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Preface 


These are the proceedings of the twenty-second International Conference on Formal Methods in Computer-Aided 
Design (FMCAD), which was held in Trento, Italy from October 18 — October 21, 2022. FMCAD was first held 
in 1996, and was a bi-annual conference until 2006, when the FMCAD and CHARME conferences merged into a 
single FMCAD conference, and since then has been held annually. FMCAD 2022 is the twenty-second edition in the 
series, covering formal aspects of computer-aided system design including verification, specification, synthesis, and 
testing. It provides a leading forum to researchers in academia and industry to present and discuss groundbreaking 
methods, technologies, theoretical results, and tools for reasoning formally about computing systems. 

The program of FMCAD 2022 consists of two tutorials, two invited talks, a student forum, and the main program 
consisting of presentations of 40 accepted peer-reviewed papers. 

The tutorial day featured two presentations: 


e On Applying Model Checking in Formal Verification by Hakan Hjort 
e Verification of Distributed Protocols: Decidable Modeling and Invariant Inference by Oded Padon 
and the main conference featured two invited talks: 


e The sel4 Verification Journey: How Have the Challenges and Olpportunities Evolved by June Andronick 

e Why Do Things Go Wrong (or Right)? Applications of Causal Reasoning to Verification by Hana Chockler 

FMCAD 2022 received 88 submissions out of which the committee decided to accept 40 for publication. 
Each submission received at least four reviews. The topics of the accepted papers include hardware and software 
verification, SAT, SMT, learning, synthesis, neural network verification, and others. Among the accepted papers, 
there are 31 regular papers (28 long and 3 short) and 9 tool/case study papers (6 long and 3 short). 

FMCAD 2022 hosted the tenth edition of the Student Forum, which has been held annually since 2013 and 
provides a platform for graduate students at any career stage to introduce their research to the FACAD community. 
The FMCAD Student Forum 2022 was organized by Mathias Preiner and featured short presentations of 21 
accepted contributions. The proceedings provide a detailed description of the Student Forum and lists all accepted 
contributions. 

Organizing this event was made possible by the support of a large number of people and our sponsors. The 
program committee members and additional reviewers, listed on the following pages, did an excellent job providing 
detailed and insightful reviews. The reviews helped us build a strong program and helped the authors improve their 
submissions. We thank each and everyone of them for dedicating their time and providing their expertise. We thank 
Martin Jonáš for acting both as the web master and as the Sponsorship Chair, and Mathias Preiner for organizing this 
year’s FMCAD Student Forum. We thank Georg Weissenbacher both for his exceptional assistance in organizing 
the event, communicating to us the decisions of the steering committee, as well as being the publication chair. 

Holding a conference like FMCAD would not be feasible without the financial support of our sponsors. We 
would like to express our gratitude to our sponsors (in alphabetical order): Amazon Web Services, Cadence, Intel, 
Meta, and Synopsys. 

The conference proceedings are available as Open Access Proceedings published by TU Wien Academic Press, 
and through the IEEE Xplore Digital Library. Last but not least, we thank all authors who submitted their papers 
to FMCAD 2022 (accepted or not), and whose contributions and presentations form the core of the conference. 
We are grateful to everyone who presented their paper, gave a keynote or gave a tutorial. We thank all attendees 
of FMCAD for supporting the conference and making FMCAD an engaging and enjoyable event. 
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D Formal Methods in Computer-Aided Design 2022 


The seL4 Verification Journey: How Have the 
Challenges and Opportunities Evolved 


June Andronick 
Proofcraft 
Kensington, Australia 
june.andronick @ proofcraft.systems 


Abstract—The formal verification journey of the seL4 microkernel is nearing two decades, and still has an busy roadmap for the 
years ahead. It started as a research project aiming for a highly challenging problem with the potential of significant impact. Today, 
a whole ecosystem of developers, researchers, adopters and supporters are part of the seL4 community. With increasing uptake and 
adoption, seL4 is evolving, supporting more platforms, architectures, configurations, and features. This creates both opportunities 
and challenges: verification is what makes seL4 unique; as the seL4 code evolves, so must its formal proofs. With more than a 
million lines of formal, machine-checked proofs, seL4 is the most highly assured OS kernel, with proofs of an increasing number 
of properties (functional correctness, binary correctness, security—integrity and confidentiality—and system initialisation) and for 
an increasing number of hardware architectures: Arm (32-bit), x86 (64-bit) and RISC-V (64-bit), with proofs now starting for Arm 
(64-bit). In this talk we will reflect on the evolution of the challenges and opportunities the seL4 verification faced along its long, 
and continuing, journey. 
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Why Do Things Go Wrong (or Right)? 
Applications of Causal Reasoning to Verification 


Hana Chockler 
King’s College London 
London, UK 
hana.chockler@kcl.ac.uk 


Abstract—In this talk I will look at the connections between causality and learning from one side, and verification and synthesis 
from the other side. I will introduce the relevant concepts and discuss how causality and learning can help to improve the quality 
of systems and reduce the amount of human effort in designing and verifying systems. I will (briefly) introduce the theory of actual 
causality as defined by Halpern and Pearl. This theory turns out to be extremely useful in various areas of computer science due to 
a good match between the results it produces and our intuition. I will illustrate the definitions by examples from formal verification. 
I will also argue that active learning can be viewed as a type of causal discovery. Tackling the problem of reducing the human effort 
from the other direction, I will discuss ways to improve the quality of specifications and will focus in particular on synthesis. 
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in Formal Verification 


Hakan Hjort 
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hhjort@cadence.com 


Abstract—Use of Hardware model checking in the EDA industry is widespread and now considered an essential part of verification. 
While there are many papers, and books, about SAT, SMT and Symbolic model checking, often very little is written about how 
these methods can be applied. Choices made when modeling systems can have large impacts on applicability and scalability. There 
is generally no formal semantics defined for the hardware design languages, nor for the intermediate representations in common 
use. As unsatisfactory as it may be, industry conventions and behaviour exhibited by real hardware have instead been the guides. 
In this tutorial we will give an overview of some of the steps needed to apply hardware model checking in an EDA tool. We will 
touch on synthesis, hierarchy flattening, gate lowering, driver resolution, issues with discrete/synchronous time models, feedback 
loops and environment constraints, input rating and initialisation/reset. 

Design compilation, also known as elaboration and (quick) synthesis, is used to create a gate netlist from a hardware description 
language, commonly System Verilog. When done for implementation this often leverages any semantic freedom in order to create a 
more efficient implementation. In contrast, for verification we prefer to preserve all possible behaviour of any valid implementation 
choice. Assertions (properties) are normally handled similarly and translated to an automata representation that is then implemented 
by a gate netlist. 

The gate netlist is a hierarchical representation of gates and their connections (to wires). Removal of hierarchy can largely be done 
replicating the logic. Most gate types represent combinatorial functions, these can be kept as is, or lowered to smaller subset of 
gate functions (such as in And-Inverter graphs). The state holding gates, (Flip-)Flops (edge sensitive) and Latches (level sensitive) 
require some more care to model their (as)synchronous behaviour. 

Special care is also needed to model Tri-state gates (and weak drivers), which can either drive a value on their output or hold it 
isolated. Verilog wire uses a domain with 4-values 0,1,X,Z where Z is high-impedance / not-driving. Resolving the drivers means 
replacing the gates that drive a common wire with a model for the resolved logic value (and possibly checks for invalid/bad 
combinations). 

It is common to have configurations, modes of operation and/or parts that should not be validated. Forcing some inputs to a fixed 
value is referred to as environment constraints. Mode complex constraints are instead normally considered part of the verification 
setup and handled as SV assumptions. The fixed values can be propagated into the gates to remove parts that become constant or 
disconnected. 

For power and performance reasons it is common that designs are multi-clocked, or that clocks are gated (can be turned off and 
on). To have a global synchronous model for verification we need to reduce these multi-clock systems to a single global system 
(or tool) clock. This is often handled by mux-feedback added to the flops/latches along with logic generating the condition for the 
muxes. Inputs to the netlist may also have constraints at which rate/phase they can change. Rated inputs are free to take any value 
but only at certain points, clock generators follow a periodic pattern. 

The use of a zero-delay timing model, meaning combinatorial gate output the function of their inputs without any delay, can give 
rise to problems when there are feedback loops in the netlist. Causing contradictions when a net would have two (or more) values, 
had there some delay in propagating the values through gates. There are 5 kinds of loops we can occur, through flops (data and 
clock), through latches (data and enable) and those only going through combinatorial gates. The ones going through flop data 
are benign, as its effect is mediated by the clock. The others need to be ruled out, or handled by modeling. Introducing some 
(fractional-)delay/steps seems an attractive approach, but establishing a bound on the number steps needed is challenging (and for 
some, no bound exists). 

Initialisation, also referred to as reset, is commonly done by applying sequence of values to a subset of inputs. This aims to get 
the design from an arbitrary unknown state into a set of states from which it will have predictable behaviour. Part of the design 
flops might have asynchronous reset, others can receive values on the data input from other flops and inputs, yet others might be 
left uninitialised. Automating the computation of an (over-)approximation of the reset states will provide more information to the 
constructed model checking problem. 


&) https://doi.org/10.34727/2022/isbn.978-3-85448-053-2_3 This article is licensed under a Creative 
BY Commons Attribution 4.0 International License 


D Formal Methods in Computer-Aided Design 2022 


Verification of Distributed Protocols: Decidable 
Modeling and Invariant Inference 


Oded Padon 
VMware Research 
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Verification of distributed protocols and systems, where both 
the number of nodes in the systems and the state-space of 
each node are unbounded, is a long-standing research goal. 
In recent years, efforts around the Ivy verification tool [1]- 
[4] have pushed a strategy of modeling distributed protocols 
and systems in a new way that enables decidable deductive 
verification [5]-[8], i.e., given a candidate inductive invariant, 
it is possible to automatically check if it is inductive, and 
to produce a finite counterexample to induction in case it is 
not inductive. Complex protocols require quantifiers in both 
models and their invariants, including forall-exists quantifier 
alternations. Still, it is possible to obtain decidability by en- 
forcing a stratification structure on quantifier alternations, of- 
ten achieved using modular decomposition techniques, which 
are supported by Ivy. Stratified quantifiers lead not only to the- 
oretical decidability, but to reliably good solver performance 
in practice, which is in contrast to the typical instability of 
SMT solvers over formulas with complex quantification. 

Reliable automation of invariant checking and finite coun- 
terexamples open the path to automating invariant infer- 
ence [9]. An invariant inference algorithm can propose a 
candidate invariant, automatically check it, and get a finite 
counterexample that can be used to inform the next candi- 
date. For a complex protocol, this check would typically be 
performed thousands of times before an invariant is found, so 
reliable automation of invariant checking is a critical enabler. 
Recently, several invariant inference algorithms [9]—[18] have 
been developed that can find complex quantified invariants for 
challenging protocols, including Paxos and some of its most 
intricate variants. 

In the tutorial I will provide an overview of Ivy’s prin- 
ciples and techniques for modeling distributed protocols in 
a decidable fragment of first-order logic. I will then survey 
several recently developed invariant inference algorithms for 
quantified invariants, and present one such algorithm in depth: 
Primal-Dual Houdini [13]. Primal-Dual Houdini is based on 
a new mathematical duality, and is obtained by deriving the 
formal dual of the well-known Houdini algorithm. As a result, 
Primal-Dual Houdini possesses an interesting formal symme- 
try between the search for proofs and for counterexamples. 
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Abstract—The Student Forum at the International Confer- 
ence on Formal Methods in Computer-Aided Design (FMCAD) 
gives undergraduate and graduate students the opportunity to 
introduce their research to the Formal Methods community and 
receive feedback. In 2022, the event took place in Trento, Italy. 
Twenty one students were invited to give a short talk and present 
a poster of their work. 


Since 2013, the FMCAD Student Forum provides a platform 
for undergraduate and graduate students at any career stage 
to present their research to the audience of the FMCAD 
conference. The 2022 edition of the FMCAD Student Forum 
follows the tradition of its predecessors, which took place in: 


e Portland, Oregon, USA in 2013 [1] 

e Lausanne, Switzerland in 2014 [2] 

e Austin, Texas in 2015 [3] and 2018 [4] 

e Mountain View, California, USA in 2016 [5] 
e Vienna, Austria in 2017 [6] 

e San Jose, California, USA in 2019 [7] 

e Virtual in 2020 [8] and 2021 [9] 


FMCAD 2022 hosted the tenth edition of the Student 
Forum. Graduate and undergraduate students were invited to 
submit two-page reports of their current research and ongoing 
work in the scope of the FMCAD conference. The Student 
Forum program committee reviewed 25 submissions out of 
which 21 were accepted. One submission was withdrawn 
by the student after acceptance resulting in 20 accepted 
submissions in total. The reviews were based on the overall 
quality, novelty of the work, its potential impact on the Formal 
Methods community, as well as the potential positive impact 
on the student to have the opportunity to participate in the 
forum. The accepted submissions covered a wide range of 
topics relevant to the FMCAD community, from foundational 
aspects of automated reasoning, to analysis and verification of 
software, hardware, and neural networks, as well as applica- 
tions of formal methods to security and biology. The following 
contributions have been accepted!: 


e Guy Amir: Verification-Driven Ensemble Selection 

e Levente Bajezi: Axiomatic Analysis of Distributed Sys- 
tems 

e Mihaly Dobos-Kovacs: Lazy abstraction for time in eager 
CEGAR 

e Bernhard Gstrein: Tuning the Learning of Circuit-Based 
Classifiers 

e Ondřej Huvar: Symbolic Coloured Model Checking for 
HCTL 


‘Only first authors listed for brevity. 
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e Omri Isac: Proof Production for Neural Network Verifi- 
cation 

e Dominik Klumpp: Commutativity in Concurrent Program 
Verification 

e Pankaj Kumar Kalita: GAMBIT: An Interactive Play- 
ground for Concurrent Programs Under Relaxed Memory 
Models 

e Hanna Lachnitt: Fine-Grained Reconstruction of cvc5 
Proofs in Isabelle/HOL 

e Tobias Paxian: Trading Accuracy For Smaller Cardinality 
Constraints 

e Siddharth Priya: SEAURCHIN: Bounded Model Checking 
for Rust 

e Sarah Sallinger: A Formalization of Heisenbugs and Their 
Causes 

e Tiago Soares: Formal Verification of Algebraic Effects 

e Dániel Szekeres: Lazy Abstraction for Probabilistic Sys- 
tems 

e Csanád Telbisz: Partial Order Reduction for Abstraction- 
Based Verification of Concurrent Software 

e Muhammad Usama Sardar: Understanding Trust Assump- 
tions for Attestation in Confidential Computing 

e Daniella Vo: Formal Approach to Identifying Genes and 
Microbes Significant to Inflammatory Bowel Disease 

e Amalee Wilson: Strategies for Parallel SMT Solving 

e Suwei Yang: Incremental Weighted Sampling 

e Tom Zelazny: On Optimizing Back-Substitution Methods 
for Neural Network Verification 


Unlike previous editions of the FMCAD student forum, 
which invited a subset of the FMCAD program committee 
to review student submissions, this year’s edition nominated 
an independent program committee (including some members 
of the FMCAD PC). The 2022 FMCAD Student Forum 
program committee consisted of Mathias Preiner (Chair), 
Armin Biere, Martin Blicha, Rayna Dimitrova, Rohit Dureja, 
Mathias Fleury, Aman Goel, Stéphane Graham-Lengrand, 
Antti Hyvärinen, Ahmed Irfan, Martin Jonáš, Daniela Kauf- 
mann, Daniel Larraz, Makai Mann, Alexander Nadel, Andres 
Noetzli, Mark Santolucito, Nestan Tsiskaridze, Tom van Dijk, 
and Florian Zuleger. 


We would like to thank the organizers of FMCAD, as well 
as the FMCAD Student Forum program committee, who have 
made the FMCAD Student Forum possible. Additionally, we 
are grateful to the student authors and their research mentors 
who have contributed their excellent work to the program. 
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Abstract—We propose a method for verifying data-poisoning 
robustness of the k-nearest neighbors (KNN) algorithm, which is 
a widely-used supervised learning technique. Data poisoning aims 
to corrupt a machine learning model and change its inference 
result by adding polluted elements into its training set. The 
inference result is considered n-poisoning robust if it cannot be 
changed by up-to-n polluted elements. Our method verifies n- 
poisoning robustness by soundly overapproximating the KNN 
algorithm to consider all possible scenarios in which polluted 
elements may affect the inference result. Unlike existing methods 
which only verify the inference phase but not the significantly 
more complex learning phase, our method is capable of verifying 
the entire KNN algorithm. Our experimental evaluation shows 
that the proposed method is also significantly more accurate than 
existing methods, and is able to prove the n-poisoning robustness 
of KNN for popular supervised-learning datasets. 


I. INTRODUCTION 


Data poisoning is an attack aimed to corrupt a machine 
learning model by polluting its training data, and thus affect 
the inference results for test data [33]. Prior work shows that 
even a small amount of polluted data, e.g., < 0.4% of the 
training set, is enough to affect the inference result [34], [6], 
[8]. Thus, verifying the robustness of the inference result in the 
presence of data poisoning is a practically important problem. 
Specifically, given a potentially-polluted training set T’, and the 
assumption that at most n elements in T are polluted, if we 
can prove that the inference result for a test input x remains 
unchanged by any n polluted elements in T, the inference 
result can still be considered trustworthy. 

This work is concerned with n-poisoning robustness of 
the k-nearest neighbors (KNN) algorithm, which is a widely 
used supervised learning technique in applications such as e- 
commerce, video recommendation, document categorization, 
and anomaly detection [18], [2], [41], [1], [30], [14], [27], 
[36], [44]. However, the verification problem is challenging 
for two reasons. First, KNN relies heavily on numerical anal- 
ysis, which involves a large number of non-linear arithmetic 
computations and complex statistical analysis techniques such 
as p-fold cross validation. They are known to be difficult for 
existing verification techniques. Second, even with a small n, 
there can be an extremely large number of possible scenarios 
in which polluted elements in T may affect the trained model 
and hence the inference result. 

Specifically, let m = |T| be the number of elements in T 
and i < n be the actual number of polluted elements in T, the 
number of clean subsets of T (where polluted elements have 
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been removed) is CI: Since 7 = 1,...,n, the total number 
of clean subsets of T is )>7_, (7). Thus, it is impractical 
to explicitly check, for each clean subset T’ C T, whether 
the inference result produced by the model trained using T” 
remains the same as the inference result produced by the model 
trained using T. 

A practical approach, which is the one used by our method, 
is to soundly over-approximate the impact of all the clean sub- 
sets while analyzing the machine learning algorithm, following 
the abstract interpretation [9] paradigm for static program 
analysis. Here, the word soundly means that our method 
guarantees that, as long as the over-approximated inference 
result is proved robust, the actual inference result is robust. In 
addition to being sound, our method is efficient in that, instead 
of training a model for each clean subset T’, it combines all 
clean subsets together to compute a set of abstract models in 
a single pass. 

For KNN, in particular, each model corresponds to an 
optimal value of the parameter K, indicating how many 
neighbors in T are used to infer the output label of a test input 
x. Thus, our method computes an over-approximated set of K 
values, denoted K Set. Then, it over-approximates the KNN’s 
inference phase, to check if the output label of x remains the 
same for all K € K Set. If the output label remains the same, 
the inference result for x is considered robust against any of 
the possible n-poisoning attacks of the training set T. 

To the best of our knowledge, our method is the first method 
that can soundly verify n-poisoning robustness of the entire 
KNN algorithm, consisting of both the learning (K parameter 
tuning) phase and the inference phase. In the literature, there 
are two closely related prior works. The first one, by Jia et 
al. [21], aims to verify the robustness of KNN’s inference 
phase only; in other words, they require the K value to be 
fixed and given, with the implicit assumption that the optimal 
K value is not affected by data poisoning. Unfortunately, 
this is not a valid assumption, as shown by the motivating 
examples presented in Section II. Furthermore, by fixing 
the K value, the more challenging part of the verification 
problem has been sidestepped, which is verifying the p-fold 
cross validation during KNN’s learning phase. How to over- 
approximate KNN’s learning phase soundly and efficiently is 
a main contribution of our work. 

The other closely-related prior work, by Drews et al. [12], 
aims to prove robustness of a different machine learning 
technique, namely the decision tree learning (DTL) algo- 
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rithm. Since DTL differs significantly from KNN in that it 
relies primarily on logical operations (such as And, Or, and 
Negation) as opposed to nonlinear arithmetic computations, 
their verification method relies on a fundamentally different 
technique (symbolic path exploration) from ours, and is not 
directly applicable to KNN. 

At a high level, our verification method works as follows. 
Given a tuple (T,n,x), where T is the potentially-polluted 
training set, n is the maximum number of polluted elements 
in T, and «x is a test input, our method tries to prove that, 
no matter which of the 2 < n elements in T are polluted, the 
KNN’s inference result for x remains the same. By default, 
the training set T’ corresponds to a model M, whose inference 
result for x is y = M(x). Using an overapproximated analysis, 
our method checks if the output label y’ = M’(x) produced 
by a model M’ corresponding to any clean subset of T” C T 
remains the same as the default label y = M(x). If that is 
the case, our method verifies the robustness of the inference 
result. Otherwise, it remains inconclusive. 

We have implemented our method and conducted experi- 
mental evaluation using six popular machine learning datasets, 
which include both small and large datasets. The small datasets 
are particularly useful in evaluating the accuracy of the ver- 
ification result because, when datasets are small, even the 
baseline approach of explicitly enumerating all clean subsets 
T’ C T is fast enough to complete and obtain the ground 
truth. The large datasets, some of which have more than 50,000 
training data elements and thus are well beyond the reach of 
the baseline enumeration approach, are useful in evaluating the 
efficiency of our method. For comparison, we also evaluated 
the method of Jia et al. [21] with fixed K values. 

Our experimental results show that, for KNN’s inference 
phase only, our method is significantly more accurate than 
the method of Jia et al. [21] and as a result, proves robust- 
ness for many more cases. Overall, our method is able to 
achieve similar empirical accuracy as the ground truth on small 
datasets, while being reasonably accurate on large datasets and 
several orders-of-magnitudes faster than the baseline method. 
In particular, our method is the only one that can finish the 
complete verification of 10,000 test inputs for a training dataset 
with more than 50,000 elements within half an hour. 

To summarize, this paper has the following contributions: 


e We propose the first method for soundly verifying data- 
poisoning robustness of the entire KNN algorithm, con- 
sisting of both the learning phase and the inference phase. 

e We evaluate the method on popular supervised learning 
datasets to demonstrate its advantages over both the 
baseline and a state-of-the-art technique. 


The remainder of this paper is organized as follows. First, 
we review the definition of n-poisoning robustness and the ba- 
sics of the k-nearest neighbors (KNN) algorithm in Section II. 
Then, we present the intuition and overview of our method in 
Section III. Next, we present our method for verifying the 
KNN learning phase in Section IV and verifying the KNN 
inference phase in Section V. We present our experimental 


results in Section VI, review the related work in Section VII, 
and give our conclusions in Section VIII. 


II. BACKGROUND 
A. Data-Poisoning Robustness 


Let L be a supervised learning algorithm that takes a set 
T = {(x,y)} of training data elements as input and returns 
a learned model M = L(T) as output. Within each data 
element, input x € ¥ C R? is an D-dimensional real-valued 
feature vector, and output y € VY C N is a natural number that 
represents a class label. The model is a prediction function 
M : X — Y that maps a test input x € ¥ to its class label 
y € V. Following Drews et al. [12], we define data-poisoning 
robustness as follows. 

a) n-Poisoning Model: Let T be a potentially-polluted 
training set, m = |T| be the total number of elements in 
T, and n be the maximum number of polluted elements in 
T. Assuming that we do not know which elements in T are 
polluted, the set of all possible scenarios is captured by the set 
of clean subsets, denoted An (T) = {T’ CT: |T\ T"| <n}. 
In other words, each T” may be the result of removing all of 
the polluted elements from T. 

b) n-Poisoning Robustness: We say the inference result 
y = M(x) for a test input x € ¥ is robust to n-poisoning 
attacks of T if and only if, for all T” € A,„(T) and the 
corresponding model M’ = L(T"’), we have M'(x) = M(x). 
In other words, the predicted label remains the same. 

For example, when T = {a,b,c,d} and n = 1, the clean 
subsets are T} = {b,c,d}, To = {a,c,d}, T3 = {a,b, d} and 
T4 = {a,b,c}, which correspond to models Mı — M4 and 
inference results xı = Mı (x), v2 = Mo(x), x3 = M3(x) and 
v4 = M4(x). Let M be the default model obtained by T and 
x = M(x) be the default output label. The inference result is 
1-poisoning robust if and only if £4 = £2 = £3 = £4 = T. 

This robustness definition has two advantages. First, when- 
ever the inference result for a test input x is proved to 
be robust, it provides a strong guarantee of trustworthiness. 
Second, the verification procedure does not require the actual 
label of x to be known, which means it is applicable to 
unlabeled test data, which are common in practice. 


B. k-Nearest Neighbors (KNN) 


KNN is a supervised learning algorithm with two phases. 
During the learning phase, the training set T' is used to 
compute the optimal value of the parameter K, which indicates 
how many neighbors in T to consider when deciding the 
output label for a test input x. During the inference phase, 
given an unlabeled test input x € X, the K nearest neighbors 
of x in T are used to compute the most frequent label, which 
is returned as the output label of x. 

The distance between data elements, which is used to find 
the nearest neighbors of x in T, is defined on the input 
feature vectors. The most widely used metric is the Euclidean 
distance: given two elements £a, £e E X C R?, where D 
is the dimension of the input feature vector, the Euclidean 


distance is y/ X2 (wali — evli))?. 
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(a) polluted set (K=3) (b) clean set (K=3) 


Fig. 1. Example of direct influence of the polluted data. 
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(a) polluted dataset (K=3) (b) clean dataset (K=5) 


Fig. 2. Example of indirect influence of polluted data. 


The optimal K value is the one that has the smallest average 
misclassification error on the training set T. The misclassifi- 
cation error is computed using p-fold cross validation, which 
randomly divides T into p groups of approximately equal size 
and, for each group, compute the misclassification error by 
treating this group as the test set and the union of all the other 
p —1 groups as the training set. Finally, the misclassification 
errors of the individual groups are used to compute the average 
misclassification error among all p groups. 


II. THE INTUITION AND OVERVIEW OF OUR METHOD 


We first present the intuition behind our method, and then 
give an overview of the method in contrast to the baseline. 


A. Two Ways of Affecting the Inference Result 


In general, there are two ways in which polluted training 
elements in T affect the inference result. One of them, called 
direct influence, is to change the neighbors of x and thus their 
most frequent label. The other one, called indirect influence, 
is to change the parameter K itself. 

Fig. 1 shows how polluted data may change the test input’s 
neighbors and thus the inference result. Here, the gray dot 
represents the test input x, while the orange and blue dots 
represent elements in the training set T. There is only one 
polluted element, which is an orange dot marked in Fig. 1 
(a). This element no longer exists in Fig. 1 (b). Assume that 
the optimal value for the parameter K is 3. For the clean set 
shown in Fig. 1 (b), the result is ‘blue’ since two of the three 
nearest neighbors of the test input x are blue. For the polluted 
set shown in Fig. 1 (a), however, the result is ‘orange’ since 
two of the three nearest neighbors are orange. 


Fig. 2 shows how polluted data may change the inference 
result by changing the optimal value of the parameter K. In 
this case, the polluted element in Fig. 2 (a) is far away from 
the test input x. However, its presence changes the optimal 
value of the parameter K during the p-fold cross validation 
phase. While the K value for the clean set is 5, the K value 
for the polluted set is 3. As a result, the most frequent label of 
the neighbors is changed from ‘blue’ in Fig. 2 (b) to ‘orange’ 
in Fig. 2 (a). 

These two examples highlight the importance of analyzing 
both the learning phase and the inference phase of the KNN 
algorithm. Otherwise, the verification result may be unsound, 
which is the case for Jia et al. [21] due to their implicit 
(and incorrect) assumption that K is not affected by polluted 
elements in T. In contrast, our method soundly verifies both 
phases of the KNN algorithm. 

While verifying the KNN inference phase itself is already 
challenging, verifying the KNN learning phase is even more 
challenging, since it uses p-fold cross validation to compute 
the optimal K value. 


B. Overview of Our Method 


Before presenting our method, we present a conceptually- 
simple, but computationally-expensive, baseline method. It 
will help explain why the verification problem is challenging. 


Algorithm 1: Baseline method KNN_Verify (T, n, x). 


for each T’ € An(T) do 
K’ + KNN_learn(T’) 
y! + KNN _predict(T’, K’, x) 
Y Set + Y Set U {y'} 

end 


robust + (|Y Set| = 1) 


a) The Baseline Method: This method relies on checking 
whether the inference result remains the same for all possible 
ways in which the training set is polluted. Algorithm 1 shows 
the pseudo code, where T is the training set, n is the maximal 
polluted number, and x is a test input. For each clean subset 
T’ € A, (T), the parameter K is computed using the standard 
KNN_learn subroutine, and used to predict the label of 
x using the standard KNN_predict subroutine. Here, Y Set 
stores the set of predicted labels; thus, |Y Set| = 1 means the 
prediction result is always the same (and hence robust). 

The baseline method is both sound and complete, and thus 
may be used to obtain the ground truth when the size of the 
dataset is small enough. However, it is not a practical solution 
for large datasets because of the combinatorial blowup — it has 
to explicitly enumerate all |A,,(Z’)| = Xo (7) cases. Even 
for m = 100 and n = 5, for example, the number becomes 
as large as 8 x 10’. For realistic datasets, often with tens of 
thousands of elements, the baseline method would not finish 
in a billion years. 

b) The Proposed Method: Our method avoids enumer- 
ating the individual scenarios in A(T). As shown in Algo- 
rithm 2, it first analyzes, in a single pass, the KNN’s learning 
phase while simultaneously considering the impact of up-to-n 


Algorithm 2: Our method abs_KNN_Verify(T, n, x). 


K Set + abs_KNN_learn(T, n) 
Y Set + abs_KNN_predict(T, n, K Set, x) 


robust + (|Y Set| = 1) 


Algorithm 3: Subroutine for the baseline: KNN_learn(T). 


Divide T into p groups {G;} of equal size; 
for each K € CandidateK set do 
for each group G; do 
errCntk =0 
for each sample (x,y) € G; do 
| errOnt¥ ++ when 
(KNN_predict(T \ Gi, K, £) Æ y); 


error¥ = errCnt¥ /|G;| 
error® = 1 5} errork 


P 


return the K value with the smallest error” 


polluted elements in T. The result of this over-approximated 
analysis is a superset of possibly-optimal X values, stored in 
K Set. Details of the subroutine aobs_KNN_learn is presented 
in Section IV. 

Then, for each K € K Set, our method analyzes the KNN’s 
inference phase while considering all possible ways in which 
up-to-n elements in J’ may have been polluted. The result 
of this over-approximated analysis is a superset of possible 
output labels, denoted Y Set. We say the inference result for 
x is robust if the cardinality of Y Set is 1; that is, the label of x 
remains the same regardless of how T may have been polluted. 
Details of the subroutine abs_KNN_predict is presented in 
Section V. 


IV. ANALYZING THE KNN LEARNING PHASE 


To understand why soundly analyzing the KNN learning 
phase is challenging, we need to compare our method with 
the the original subroutine, KNN_learn, shown in Algo- 
rithm 3, which computes the optimal A value using p- 
fold cross-validation. Note that both the value of p and the 
CandidateK set are hyper-parameters of the KNN algorithm 
itself, not part of the verification method. In practice, they 
typically do not depend on the size of T (see Section II-B for 
a detailed explanation). 


A. The Algorithm 


In contrast, our method shown in Algorithm 4 computes an 
over-approximated set of K values. The input consists of the 
training set T and the maximal polluted number n, while the 
output K Set is a superset of the optimal K values. 

Inside Algorithm 4, our method first computes the lower and 
upper bounds of the misclassification error for each K value, 
by considering the best case (errorL.B*) and the worst case 
(errorU B®) when up-to-n elements in T are polluted. 

After computing the interval [error LB*, errorU B£] for 
each K value, it computes minU B, which is the minimal 
upper bound among all K values. 
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Algorithm 4: Subroutine K Set = abs_KNN_learn(T, n). 


Divide T into p groups {G;} of equal size; 
for each K € CandidateK set do 
for each group G; do 
errCntLBK = errCntU BK = 0; 
for each sample (x,y) € G; do 
errCntL BK ++ if 
(abs_KNN_cannot_obtain_correct_label(T’ \ 
Gi,n, K,x,y) == True); 
errCntU BK ++ if 
(abs_KNN_may_obtain_wrong_label(T’ \ 
Gi,n, K,x,y) == True); 


errorLBK = max{0, (errCntLBK —n)/(\Gi| —n)}; 
| errorUBK = min{errCntU BK /(|G;| —n), 1}; 
‘ _1 i : 
errorLBK = z dP, errorLBF; 
errorUBE = 5 PL, errorUB¥, 


Let minU B = the smallest errorU BE for all K; 
KSet = {K | errorLB® < minU B}; 


Error 


minUB | | | 


Fig. 3. Example of comparing the error bounds. 


Then, by comparing minU B with the errorLB¥ for each 
K, it over-approximates the set of possible K values that may 
become the optimal K value for some T’ € An (T). 

Here, the intuition is that, by excluding K values that are 
definitely not the optimal K for any T’ € A,(T) — they 
are the ones whose errorLB* is larger than minU B — we 
obtain a sound over-approximation in K Set. 

a) Example for minUB: Fig. 3 shows an ex- 
ample, where each vertical bar represents the interval 
[error LB* , errorU B*] of a candidate K value, and the blue 
dashed line represents minU B. The selected K values are 
those corresponding to the blue bars, since their error LB* 
are smaller than minU B. The K values corresponding to the 
gray bars are dropped, since they definitely cannot have the 
smallest misclassification error. 

b) The Soundness Guarantee: To understand why the 
KSet computed in this manner is an over-approximation, 
assume that minUB = errorUB*’ for some value K’. We 
now explain why K cannot be the optimal value (with the 
smallest error) when errorLB* > minUB. Let the actual 
errors be error® € [errorLB* ,errorU B®] and error®’ € 
[errorLB*' ,errorUB*']. Since we have errorLB® > 
errorU BX, we know error must be larger than error® , 
Therefore, K cannot have the smallest error. 


To compute the interval [errorLB*, errorU B*], we add 
up the misclassification error for each element (x,y) € Gi, 


where x € X is the input and y € Y is the (correct) label. 
For each element (x,y), there is a misclassification error if, 
for some reason, y differs from the predicted label. 

Here, errCntLB* corresponds to the best case scenario 
— removing n elements from T in such a way that prediction 
becomes as correct as possible. In contrast, errCntU BK 
corresponds to the worst case scenario — removing n elements 
from T in such a way that prediction becomes as incorrect 
as possible. These two error counts are computed by two 
subroutines, which will be presented later in this section. 

To convert errCntLB* and errCntU B* to error rates, 
we consider removing n misclassified elements when comput- 
ing the lower bound errorLB*, and removing n correctly- 
classified data elements when computing the upper bound 
errorUB. We assume n < |G;|, which is a reasonable 
assumption in practice. 

To explain subroutines abs_cannot_obtain_correct_label 
and abs_may_obtain_wrong_label, we need to introduce 
some notations, including label counter and removal strategy. 


B. The Label Counter 


Nearest Neighbors 7‘. Let T be a subset of T consisting 
of the K nearest neighbors of x. For example, given T = 
{((0.1, 0.1), l2), ((1.1, 0.1), l), (0.1, 1.1), l4), (2.1, 3.1), l3), 
((3.3, 3.1), l3)}, test input x = (1.1,1.1), and K = 3, the set 
is T3 = {((0.1, 0.1), l2), (1.1, 0.1), l1), ((0.1, 1.1), l1)}. Here, 
we assume each neighbor has two real-valued input features 
and three possible output class labels lı — l3. 

Label Counter €(7*). Given any dataset Z, including 
TE, we use €(Z) = { (li : #l;) } to represent the label 
counts, where l; is a class label, and #l; € N is the number 
of elements in Z that have the label /;. For example, given T3 
above, we have €(T3) = {(li : 2), (l2 : 1)}, meaning it has 
two elements with label lų and one with label lz. 

Most Frequent Label F'rreq(€(T* )). Given a label counter 
E, the most frequent label, denoted Freq(E), is the label 
with the largest count. Similarly, we can define the second 
most frequent label. Thus, the KNN inference phase can be 
described as computing Freq(€(T*)) for the training set T, 
test input x, and K value. 

Tie-Breaker 1 (;, <;,). If two labels have the same frequency, 
the KNN algorithm may use their lexicographic order as a tie- 
breaker to ensure that Freq(E) is unique: Let < be the order 
relation, (l; < lj) must be either true or false. Thus, we define 
an indicator function, 1(;, <;,), to return the numerical value 1 
(or 0) when (l; < l;) is true (or false). 


C. The Removal Strategy 


The removal strategy is an abstract way of modeling the 
impact of polluted data elements. In contrast, the removal set 
is a concrete way of modeling the impact. 

The Removal Set. Given a dataset Z, the removal set 
R C Z can be any subset of Z. Given T? above, for example, 
there are 6 possible removal sets: Ri = {(%1,y1)}, Ro = 


{((x2,y2))}, Rs = {(v3,ys3)}, Ra = {(21, y1), (v2, yo)}, 


Rs = {(11, 41), (x3, y3)}, and Re = {(x2, y2), (3, ys) }- In 
particular, Rı means removing element (21, y1) from Z. 

The Removal Strategy. The removal strategy is simply the 
label counter of a removal set R, denoted S = €(R). In the 
above example, the six removal sets correspond to only four 
removal strategies Sy = {(l) : 1)}, So = {(l2 : 1)}, S = 
{(l : 1), (lg: 1)}, and S4 = {(, : 2)} . In particular, S2 
means removing an element labeled l2; however, it does not 
say which of the l elements is removed. Thus, it captures any 
removal set that has the same label counter. 

The Strategy Size. Let the removal strategy be denoted S = 
{ (Ui : #li)}, we define the size as ||S|| = Xa, #1es #li — it 
is the total number of removed elements. For Sı = {(11 : 1)}, 
S2 = {(l2 : 2)}, and S3 = {(l : 1), (l3 : 3)}, the strategy size 
would be ||S;|| = 1, ||S2|| = 2, and ||S3|| = 4. 

In the context of the abstract interpretation paradigm [9], 
the removal sets can be viewed as the concrete domain while 
the removal strategies can be viewed as the abstract domain. 
Focusing on the abstract domain during verification makes our 
method more efficient. Let |£| be the total number of class 
labels, which is often small in practice (e.g., 2 or 10). Since the 
count of each label in a removal set is at most n, the number 
of removal strategies is at most )>," 9 Cre), This can be 
exponentially smaller than the number of possible removal 
sets, which is 7", ('7'). 


D. Misclassification Error Bounds 


Using the notations defined so far, we present our method 
for computing the lower and upper bounds, errC'ntL BX and 
errCntU BK, as shown in Algorithms 5 and 6. 

Both bounds rely on computing T+", the K +n neighbors 
of x in T, and the label counter €(T**"). 

e The first subroutine checks whether it is impossible, even 
after removing up-to-n elements from T, that the correct 
label y becomes the most frequent label. 

e The second subroutine checks whether it is possible, after 
removing up-to-n elements from T, that some wrong 
label becomes the most frequent label. 

Before explaining the details, we present Theorem 1, which 
states the correctness of these checks. It says that, to model the 
impact of all subsets T’ € A,,(T), we only need to analyze 
the (K +n) nearest neighbors of x, stored in T+”, 


Theorem 1 YT” € A, (T), we have Freq(E((T’)*)) € 
{Freq(E(Te*")\ SIS CECT), ISI] < n} 


For brevity, we omit the detailed proof. Instead, we give the 
intuition behind the proof as follows: 

e For each clean training subset T” € A,(T), we can 

always find a label counter €(T+*) and a removal 
strategy S € €(T*+*), where ||S|| = i < n, satisfying 
€(TE+\ 8) = ETIE). 
If we want to check all the predicted labels of x generated 
by all T’ € A,(T), we need to search through all of 
(TE), E(TE*), ..., E(TE*™), which is expensive 
when n is large. 


Algorithm 5: Subroutine used in our Algorithm 4 flag = 
abs_KNN_cannot_obtain_correct_label(T, n, K, x,y). 


Let €(T#+") be the label counter of TË +”; 

Define removal strategy S = { (y : #y! — #y tly cy) | (y’: 
#y') E ETET”) y! Ay, Hy! > #y}s 

return (||S|| > n); 


Algorithm 6: Subroutine used in our Algorithm 4 flag = 
abs_KNN_may_obtain_wrong_label(T,n, K, x,y). 


Let E(TŽ +”) be the label counter of TX +”; 
Let y’ be the most frequent label in €(T**") except the label y; 
Define removal strategy 
S={ (y : max(0,#y — #4! t+ Lyey}) i 
return (||S|| < n); 


e Fortunately, €(T+") \ S, where ||S|| < n, contains all 
the possible scenarios denoted by €(T‘*") \ S, where 
\|S|] = i and i = 0,... n — 1. 

As a result, we only need to analyze E (T +”), which corre- 
sponds to the (K + n) nearest neighbors of x; other elements 
which are further away from x can be safely ignored. 


E. Algorithm 5 


To compute the lower bound errCntLB*, Algorithm 5 
checks if all the strategies S satisfying Freq(€(T*")\S) = 
y and S C €(TK*”) must have ||S|| >n. 

Fig. 4 shows two examples. In each example, the gray dot 
is the test input x and the other dots are neighbors of x in 
T+”, In Fig. 4 (a), #orange = 2 is the number of orange 
dots (votes of the correct label). In contrast, #blue = 5 and 
#green = 2 are votes of the incorrect labels. By assuming 
the lexicographic order blue < green < orange, we define 
the indicator functions (tie-breakers) as lpine<orange = 1 and 
Igreen<orange =1. 

Given the removal strategy S = {(blue : 4), (green : 1)}, 
we know ||S|| = 5 and, since n = 4, we have ||S|| > n. 
Thus, removing up to n =4 dots cannot make the test input x 
correctly classified (as orange). As a result, errCntLBK ++ 
is executed to increase the lower bound. 

In Fig. 4 (b), however, since #blue = 4, #torange = 3, 
litue<orange = 1, and S = {(blue : 2)}, we have ||S|| = 
2. Since ||S|| < n, removing up to n =4 dots can make 
the test data x correctly classified (as orange). As a result, 
errCntLB* + + is not executed. 


F. Algorithm 6 


To compute the upper bound errCntUB*, Algorithm 6 
checks if there exists a strategy S that satisfies the condition: 
Freg(E(TEt")\ S) Æ y, S C E(TE*™), and ||S|| < n. 

Fig. 5 shows two examples. In Fig. 5 (a), #orange = 2 
is the number of correct label, and #blue = 5 is the number 
of dots with the most frequent wrong label. Thus, S = @ and 
since ||S|| < n, we know that removing up to n = 4 dots can 
make the test data misclassified. As a result, errCntU BE ++ 
is executed. 
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(a) S = { (blue : 4), (green : 1)} (b) S = { (blue : 2)} and return 


and return value is true. value is false. 


Fig. 4. Examples for Algorithm 5 with K = 5, n = 4, and y = orange 
being the correct label. 
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(a) S = @ and return value is (b) S = { (orange : 5)} and return value 
true. is false. 


Fig. 5. Example for Algorithm 6 with K = 5, n = 4, y = orange as 
correct label, and y’ = blue as the most frequent wrong label. 


In Fig. 5 (b), #orange = 7 is the number of orange dots, 
#blue = 2 is the number of dots with the most frequent 
wrong label. Here, we assume I orange<blue = 0. Thus, S = 
{(orange : 5))} and since ||S|| > n, we know that removing 
up to n = 4 dots cannot make ‘blue’ (or any other wrong 
label) the most frequent label. As a result, errOntU BE + + 
is not executed. 


V. ANALYZING THE KNN INFERENCE PHASE 


In this section, we present our method for analyzing the 
KNN inference phase, implemented in Algorithm 2 as the sub- 
routine Y Set = abs_K NN_predict(T,n, KSet,x), which 
returns a set of output labels for test input x, by assuming 
that T contains up-to-n polluted elements. 


A. Computing the Classification Labels 


Algorithm 7 shows our method, which first checks whether 
the second most frequent label (y’) can become the most 
frequent one after removing at most n elements. This is 
possible only if there exists a strategy S such that (1) it 
removes at most n elements labeled y, and (2) after the 
removal, y’ becomes the most frequent label. This is captured 
by the condition ||S|| = (#y — #y' + ly<y:) < n. Otherwise, 
the predicted label is not unique. 

We do not attempt to compute more than two labels, as 
shown by the return statement in the then-branch, because 
they are not needed by the top-level procedure (Algorithm 2), 
which only needs to check if |Y Set| = 1 for the purpose of 
proving n-poisoning robustness. 
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Algorithm 7: Method abs_KNN_predict(T, n, K Set, x). 


YSet = { } 
visited = { } 
while 3K € (KSet \ visited) do 


Let €(TE+") be the label counter of TX +”; 
Let y be the most frequent label of E (TŻ +"); 
Let y’ be the second most frequent label of £ (T +"); 
Let removal strategy S = { (y : #y — #y’ + lycy’) }s 
if ||S|| < n then 

Y Set = Y Set U {y, y’}; 

return Y Set; 
else 

Y Set = Y Set U {y}; 

KLB =K #y Hy! n Ly<y) 

KUB = K + (#y -— #y' —n-lycy 

visited = visited U [KŻ , KUB] 


return Y Set; 


B. Pruning Redundant K Values 


Inside Algorithm 7, after checking K € K Set, our method 
puts K into the visited set to make sure it will never be 
checked again for the same test input x. In addition, it 
identifies other values in AK Set that are guaranteed to be 
equivalent to K, and prunes away these redundant values. 
Here, equivalent K values are defined as those with the same 
inference result for test input zx. 

To be conservative, we underapproximate the set of equiv- 
alent K values. As a result, these K values can be safely 
skipped since the (equivalent) inference result has been 
checked. This optimization is implemented using the visited 
set in Algorithm 7. The visited set is computed from K and 
E(T*+") based on the expression (#y — #y’ — n — lycy) 
over the removal strategy. 


a) The Correctness Guarantee: We now explain why this 
pruning technique is safe. The intuition is that, if the most 
frequent label Freq(€(T+*")) is the label with significantly 
more counts than the second most frequent label, then it may 
also be the most frequent label for another value K’. There 
are two possibilities: 


e If (K’ < K), then T*’*” has (K — K’) fewer elements 
than T+". Since removing elements from the neighbors 
will not increase the label count #y’, the only way to 
change the inference result is decreasing the label count 
#y. When (K — K') < (#y — #y' -n — ly<y), 
decreasing #y will not make any difference. Thus, the 
lower bound of K’ is K — (#y — #y' —n— lyrey). 

If (K’ > K), then T*'+” has (K’ — K) more elements 
than T+”, Since adding elements to the neighbors will 
not decrease the label count #y, the only way to change 
the inference result is increasing the label count #y/’. 
However, as long as (K’ — K) < (#y — #y’ — n), 
increasing #y’ will not make any difference. Thus, the 
upper bound of K’ is K + (#y — #y' —n— Ly <y). 

For example, consider K = 13, n = 2, and €(T}°) = {(l, : 
12), (l2 : 2), (l3 : 1)}. According to Algorithm 7, #y — #y' — 
n— ly<y = 12—2—2 = 8 and thus we compute the interval 
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TABLE I 
STATISTICS OF THE SUPERVISED LEARNING DATASETS. 


Name # training data | # test data | # output label | # input dimension 
ATD | (XSet|) £) D) 

Tris [15] 135 15 3 4 

Digits [17] 1,617 180 10 64 

HAR [3] 9,784 515 6 561 

Letter [16] 18,999 1,000 26 16 

MNIST [24] 60,000 10,000 10 36 

CIFAR10 [23] 50,000 10,000 10 288 


[13 — 8,13 + 8] = [5,21]. As a result, candidate K values in 
the set {5,6,7,...,21} can be safely skipped. 


VI. EXPERIMENTS 


We have implemented our method in Python and using the 
machine learning library scikit-learn 0.24.2, and evaluated it 
on two sets of supervised learning datasets. Table I shows the 
statistics, including the name, size of the training set, size of 
the test set, number of output class labels, and dimension of the 
input feature space. For MNIST and CIFAR1O, in particular, 
the features were extracted using the standard histogram of 
oriented gradients (HOG) method [10]. 

The first set of datasets consists of Iris and Digits, two 
small datasets for which even the baseline method as shown 
in Algorithm 1 can finish and thus obtain the ground truth. We 
use the ground truth to evaluate the accuracy of our method. 
The second set of datasets consists of HAR, Letter, MNIST, 
and CIFAR1O, which are larger datasets used to evaluate the 
efficiency of our method. 

For comparison purposes, we also implemented the baseline 
method in Algorithm 1, and the method of Jia et al. [21], which 
represents the state of the art. Experiments were conducted 
on polluted training sets obtained by randomly inserting < 
n input and output mutated samples to the original datasets. 
Since the same polluted training sets are used to compare all 
verification methods, and since the verification methods are 
deterministic, there is no need to run the experiments multiple 
times and then compute the average. Instead, we run each 
verification method on each polluted training set once. All 
experiments were conducted on a computer with a 2 GHz 
Quad-Core Intel Core 15 CPU and 16 GB of memory. 


A. Results on the Small Datasets 


We first compared our method with the baseline on the 
small datasets where the baseline method could actually finish. 
This is important because the baseline method does not rely 
on over-approximation, and thus can obtain the ground truth. 
Here, the ground truth means which of the test data have 
inference results that are actually robust against n-poisoning 
attacks. By comparing the ground truth with our result, we 
were able to evaluate the accuracy of our method. 

Table II shows the results. Column 1 shows the name of 
the dataset and the polluted number n. Columns 2-3 show 
the result of the baseline method, consisting of the number of 
verified test data and the time taken. Similarly, Columns 4-5 


TABLE II 
RESULTS OF OUR METHOD AND THE BASELINE METHOD ON THE SMALL 
DATASETS WITH THE MAXIMAL POLLUTED NUMBER n=1, 2, AND 3. 


Name Baseline New Method Accuracy 
# robust | time (s) | # robust | time (s) 
Tris (n=1) 15/15 60 14/15 1 93.3% 
iris (n=2) 14/15 4,770 13/15 1 92.9% 
iris (n=3) - >9,999 11/15 1 - 
Digits (n=1) | 179/180 8,032 | 172/180 1 96.1% 
Digits (n=2) - >9,999 | 170/180 1 - 
Digits (n=3) - >9,999 | 165/180 1 - 


show the result of our method. Column 6 shows the accuracy 
of our method in percentage. 

The results indicate that, for test data that are indeed robust 
according to the ground truth, our method can successfully 
verify most of them. In Iris (n=2), for example, Column 2 
shows that 14 of the 15 test data are robust according to the 
baseline method, and Column 4 shows that 13 out of these 15 
test data are verified by our method. Therefore, our method is 
92.9% accurate. 

Our method is much faster than the baseline. For Digits 
(n=1), in particular, our method took only 1 second to verify 
172 out of the 180 test data as being robust while the 
baseline method took 8,032 seconds. As the polluted number 
n increases, the baseline method ran out of time even for 
these small datasets. As a result, we no longer have the 
ground truth needed to directly measure the accuracy of our 
method. Nevertheless, since all cases verified by our method 
are guaranteed to be robust, the number of verified test data in 
Column 4 of Table II serves as a proxy — it decreases slowly 
as n increases, indicating that the accuracy of our method 
remains high. 


B. Results on the Large Datasets 


We also evaluated our method on the large datasets. Table III 
summarizes the results on these large datasets as well as the 
two small datasets but with larger polluted numbers (n). Since 
these verification problems are out of the reach of the baseline 
method, we no longer have the ground truth. Thus, instead of 
measuring the accuracy, we measure the percentage of test 
data that we can verify, shown in Column 3 of Table III. 

For example, in Iris, n = 1 ~ 5 (4%) in Column 2 means 
that these experiments were conducted for each poisoning 
number n = 1,2,...5. Since the training dataset has 135 
elements, n = 5 means 4% (or 5/135) of these training data 
may have been polluted. In Column 3, 93.3% is the percentage 
of verified test data for n = 1, while 73.3% is the percentage 
of verified test data for n = 5. Except for Iris, which has a 
small number of training data, we set the poisoning number 
n to be less than 1% of the training dataset. 

Overall, our method remains fast as the sizes of T, X Set 
and n increase. For MNIST, in particular, our method finished 
analyzing both 10-fold cross validation and KNN inference in 
26 minutes, for all of the 60,000 data elements in the training 
set and 10,000 data elements in the test set. In contrast, the 
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TABLE III 
RESULTS OF OUR METHOD ON LARGE DATASETS, AND ON SMALL 
DATASETS BUT WITH LARGER POLLUTED NUMBERS. 


Name Polluted Number | Verified Percentage | Verification Time 
(n) (# robust/|X Set|) (s) 
Iris 1~5 (4%) 93.3%~73.3% I~ 
Digits I~16 (1%) 95.6%~80.6% 1~2 
HAR 1~98 (1%) 99. 4%~71.7% 85 ~ 93 
Letter 1~190 (1%) 94.0% ~5.5% 33 ~ 43 
MNIST 1~600 (1%) 99.9%~53.5% 888 ~ 994 
CIFAR10 1~500 (1%) 99.2% ~2.8% 1,453 ~ 1,559 


baseline method failed to verify any of the test data within the 
9999-second time limit. 

Without the ground truth, the verified percentage provides 
a lower bound on the number of test data that remain robust 
against data-poisoning attacks. When n=1, the verified per- 
centage in Column 3 is high for all datasets. As the polluted 
number n increases to 1% of the entire training set T, the 
verified percentage decreases. Furthermore, the decrease is 
more significant for some datasets than for other datasets. For 
example, In MNIST, at least 53.5% of the test data remain 
robust under 1% (or 600) poisoning attacks. In CIFAR10, 
however, only 2.8% of the test data remains robust under 
1% (or 500) poisoning attacks. Thus, the relationship between 
the verified percentage and the polluted number reflects more 
about the unique characteristics of these datasets. By this, we 
mean that if one dataset has more truly-non-robust cases than 
another dataset, then the verifier will report more cannot-be- 
verified cases. 

The reason why the accuracy is low for Letter and CIFAR10 
datasets is because they have larger attack surfaces in the 
extracted feature space: elements from the same class are not 
sufficiently concentrated in one area, and the neighbors include 
many elements from other classes. Thus, small changes to the 
neighbors can lead to significant changes of the class label. 
While we believe that the accuracy (measured by the verified 
percentage) may improve if a better feature extractor is used 
(to improve the quality of extracted features), it is out of the 
scope of the verification task. 


C. Compared with the Existing Method 


While our method is the only one that can verify the 
entire KNN algorithm, there are existing methods that can 
verify part of the KNN algorithm. The most recent method 
proposed by Jia et al. [21], in particular, aims to verify the 
KNN inference step with a given K value; thus, it can be 
regarded as functionally equivalent to the subroutine of our 
method as presented in Algorithm 7. However, our method is 
significantly more accurate due to its tighter approximation. To 
experimentally demonstrate the advantage of our method, we 
used their method to replace Algorithm 7 in our own method 
before conducting the experimental comparison. Since an 
open-source implementation of their method is not available, 
we have implemented it ourselves. 

Fig. 6 shows the results, where blue lines represent our 
method and orange lines represent their method [21]. Overall, 
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Fig. 6. Comparing our method (blue) with Jia et al. [21] (orange): the x-axis 
is polluted number n and the y-axis is the percentage of verified test data. 


the verified percentage obtained by our method is significantly 
higher, due to its tighter approximations during the KNN 
inference phase. For all datasets, the verified percentage ob- 
tained by their method drops more quickly than the verified 
percentage obtained by our method. For Iris, in particular, their 
method cannot verify any of the test data, while our method 
can verify more than 70% of them as being robust. 


VII. RELATED WORK 


There is a large body of work on verifying the (local) 
robustness of machine learning algorithms using formal meth- 
ods. However, unlike most prior works which focus on adver- 
sarial examples in the context of deep neural networks, this 
work focuses on poisoned datasets for KNN. Unlike neural 
networks, for which scalability of the verification method 
typically depends on the network size but not the size of the 
training data, for KNN, scalability depends on the size of the 
training data and the number of poisoned elements. 

In the context of robustness verification for KNN, our 
method is a method that can soundly verify n-poisoning ro- 
bustness of the entire KNN algorithm, while existing methods 
such as Jia et al. [21] and others [39], [20], [40] are either 
restricted to a small part of what constitutes a state-of-the-art 
KNN system or primarily theoretical (and thus not scalable). 
Since we follow the definition of n-poisoning robustness 
in Drews et al.[12] instead of Jia et al. [21], our method 
only handles the removal of elements from already-polluted 
datasets, but not addition/modification of elements for clean 
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datasets. Extending our method to handle such cases will be 
future work. 

In addition to this line of research, there is a large body of 
work on adversarial data poisoning in general. 


Data Poisoning in General KNN is not the only type of 
machine learning techniques found vulnerable to adversarial 
data poisoning; prior work shows that regression models [29], 
support vector machines (SVM) [6], [43], [42], clustering 
algorithms [7], and neural networks [34], [37], [11], [45] 
are also vulnerable. Unlike our work, this line of research 
is primarily concerned with showing the security threats and 
identifying the poisoning sets, which is often formulated as a 
constrained optimization problem. 


Mitigating Data Poisoning Techniques have been proposed 
to mitigate data poisoning for various machine learning al- 
gorithms [35], [38], [19], [13], [5]. There are also tech- 
niques [22], [28] for assessing the effectiveness of mitigation 
techniques such as data sanitization [22] and differentially- 
private countermeasures [28]. More recently, Bahri et al. [4] 
propose a method that leverages both KNN and a deep neural 
network to remove mislabeled data. 


Certifying the Defenses Probabilistically There are tech- 
niques for certifying the defenses [32], [25] such that accuracy 
is guaranteed probabilistically. For example, Rosenfeld et 
al. [32] leverage randomized smoothing to guarantee test-time 
robustness to adversarial manipulation with high probability. 
Levine et al. [25] certify robustness of a defense by deriving a 
lower bound of classification error, which relies on their deep 
partition aggregation (DPA) learning and is not applicable to 
typical learning approaches. 


Leveraging KNN for Attacks or Defenses Orthogonal to 
our work, there are techniques that leverage KNN to generate 
attacks or provide defenses for other machine learning models. 
For example, Li et al. [26] present a data-poisoning attack that 
leverages KNN to maximize the effectiveness of malicious 
behavior while mimicking the user’s benign behavior. Peri et 
al. [31] use KNN to defend against adversarial input based 
attacks, although it focuses only on tweaking the test input 
during the inference phase. 


VIII. CONCLUSIONS 


We have presented the first method for soundly verifying 
n-poisoning robustness for the entire KNN algorithm that 
includes both the learning (K parameter tuning) and the 
inference (classification) phases. It relies on sound overap- 
proximations to exhaustively and yet efficiently cover the 
astronomically large number of possible adversarial scenarios. 
We have demonstrated the accuracy and efficiency of our 
method, and its advantages over a state-of-the-art method, 
through experimental evaluation using both small and large 
supervised-learning datasets. Besides KNN, our method for 
soundly over-approximating p-fold cross validation may be 
used to analyze similar cross-validation steps frequently used 
in other modern machine learning systems. 
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Abstract—With the increasing application of deep learning in 
mission-critical systems, there is a growing need to obtain formal 
guarantees about the behaviors of neural networks. Indeed, many 
approaches for verifying neural networks have been recently 
proposed, but these generally struggle with limited scalability or 
insufficient accuracy. A key component in many state-of-the-art 
verification schemes is computing lower and upper bounds on the 
values that neurons in the network can obtain for a specific input 
domain — and the tighter these bounds, the more likely the ver- 
ification is to succeed. Many common algorithms for computing 
these bounds are variations of the symbolic-bound propagation 
method; and among these, approaches that utilize a process 
called back-substitution are particularly successful. In this paper, 
we present an approach for making back-substitution produce 
tighter bounds. To achieve this, we formulate and then minimize 
the imprecision errors incurred during back-substitution. Our 
technique is general, in the sense that it can be integrated into 
numerous existing symbolic-bound propagation techniques, with 
only minor modifications. We implement our approach as a proof- 
of-concept tool, and present favorable results compared to state- 
of-the-art verifiers that perform back-substitution. 


I. INTRODUCTION 


Deep neural networks (DNNs) are dramatically changing 
the way modern software is written. In many domains, such as 
image recognition [43], game playing [42], protein folding [2] 
and autonomous vehicle control [12], [30], state-of-the-art 
solutions involve deep neural networks — which are artifacts 
learned automatically from a finite set of examples, and which 
often outperform carefully handcrafted software. 

Along with their impressive success, DNNs present a sig- 
nificant new challenge when it comes to quality assurance. 
Whereas many best practices exist for writing, testing, verify- 
ing and maintaining hand-crafted code, DNNs are automati- 
cally generated, and are mostly opaque to humans [24], [25]. 
Consequently, it is difficult for human engineers to reason 
about them and ensure their correctness and safety — as most 
existing approaches are ill-suited for this task. This challenge 
is becoming a significant concern, with various faults being 
observed in modern DNNs [5]. One notable example is that 
of adversarial perturbations — small perturbation that, when 
added to inputs that are correctly classified by the DNN, result 
in severe errors [20], [48]. This issue, and others, call into 
question the safety, security and interpretability of DNNs, and 
could hinder their adoption by various stakeholders. 

In order to mitigate this challenge, the formal methods 
community has taken up interest in DNN verification. In the 
past few years, a plethora of approaches have been proposed 
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for tackling the DNN verification problem, in which we are 
given a DNN and a condition abouts its inputs and outputs; 
and seek to either find an input assignment to the DNN that 
satisfies this condition, or prove that it is not satisfiable [1], 
[8], [10], [14], [21], [27], [29], [31], [33], [39], [51], [57]. 
The usefulness of DNN verification has been demonstrated 
in several settings and domains [21], [27], [31], [47], but 
most existing approaches still struggle with various limitations, 
specifically relating to scalability. 

A key technical challenge in verifying neural networks is to 
reason about activation functions, which are non-linear (e.g., 
piece-wise linear) transformations applied to the output of each 
layer in the neural network. Precisely reasoning about such 
non-linear behaviors requires a case-by-case analysis of the 
activation phase of each activation function, which quickly 
becomes infeasible as the number of non-linear activations 
increases. Instead, before performing such a search procedure, 
state-of-the-art solvers typically first consider linear abstrac- 
tions of activation functions, and use these abstractions to 
over-approximate the values that the activation functions can 
take in the neural network. Often, these over-approximations 
significantly curtail the search space that later needs to be 
explored, and expedite the verification procedure as a whole. 

A key operation that is repeatedly invoked in this compu- 
tation of over-approximations is called back-substitution [45], 
where the goal is to compute, for each neuron in the DNN, 
lower and upper bounds on the values it can take with respect 
to the input region of interest. This is done by first express- 
ing the lower and upper bounds of a neuron symbolically 
as a function of neurons from previous layers, and then 
concretizing these symbolic bounds with the known bounds 
of neurons in those previous layers. Such a technique is 
essential in state-of-the-art solvers (e.g., [32], [45], [54]) and 
is often able to obtain sufficiently tight bounds for proving 
the properties with respect to small input regions. However, it 
tends to significantly lose precision when the input region (i.e., 
perturbation radius) grows, preventing one from efficiently 
verifying more challenging problems. 

In this work, we seek to improve the precision and scala- 
bility of DNN verification techniques, by reducing the over- 
approximation error in the back-substitution process. Our key 
insight is that, as part of the symbolic-bound propagation, one 
can measure the error accumulated by the over-approximations 
used in back-substitution. Often, the currently computed bound 
can then be significantly improved by “pushing” it towards the 
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true function, in a way that maintains its validity. For example, 
suppose that we upper-bound a function f with a function 
g, ie. Va. g(x) > f(x). If we discover that the minimal 
approximation error is 5, i.e. min, {g(x) — f(x)} = 5, then 
g(x) — 5 can be used as a better upper bound for f than 
the original g. By integrating this simple principle into the 
back-substitution process, we show that we can obtain much 
tighter bounds, which eventually translates to the ability to 
verify more difficult properties. 

We propose here a verification approach, called Deep- 
MIP, that uses symbolic-bound tightening enhanced with our 
error-optimization method. At each iteration of the back- 
substitution, DeepMIP invokes an external MIP solver [26] 
to compute bounds on the error of the current approximation, 
and then uses these bounds to improve that approximation. 
As we show, this leads to an improved ability to solve 
verification benchmarks when compared to state-of-the-art, 
symbolic-bound tightening techniques. We discuss the differ- 
ent advantages of the approach, as well as the extra overhead 
that it incurs, and various enhancements that could be used to 
expedite it further. 

The rest of the paper is organized as follows. We begin by 
presenting the necessary background on DNNs, DNN verifica- 
tion, and on symbolic-bound propagation in Sec. II. Next, in 
Sec. III we show how one can express the approximation error 
incurred as part of the back-substitution process. In Sec. IV we 
present the DeepMIP algorithm, followed by its evaluation in 
Sec. V. Related work is discussed in Sec. VI, and we conclude 
in Sec. VII. 


II. BACKGROUND 


Neural networks. A fully-connected feed-forward neural net- 
work with k +1 layers is a function N : R™ — R”. Given an 
input x € R™, we use N; (x) to denote the values of neurons 
in the i” layer (0 < i < k). The output of the neural network 
N(x) is defined as N;,(a), which we refer to as the output 
layer. More concretely, for 1 <i < k, 


Nj(x) = 0(W*"!Nj_-1(@) + bt) 


where W*~! is a weight matrix, b'~! is a bias vector, o is 
an activation function (in this paper, we focus on the ReLU 
activation function, defined as ReLU(z) = max{0,x} and 
use g and ReLU interchangeably unless otherwise specified) 
and No (a) = a. We refer to No as the input layer. Typically, 
non-linear activations are not applied to the output layer. Thus, 
when 7 = k, we let ø be the identity function. We note that our 
techniques are general, and apply to other activation functions 
(MaxPool, LeakyReLU) and architectures (e.g., convolutional, 
residual). 


Neural network verification. The neural network verification 
problem [31], [39] is defined as follows: given an input domain 
D; C R” and an output domain domain Do C R”, the goal is 
to determine whether Va € D;, N(x) € Do. If the answer is 
affirmative, we say that the verification property pair (D;, Do) 
holds. In this paper, we assume that the neural network has 


a single output neuron and that the verification problem can 
be reduced to the problem of finding the minimum and/or 
maximum values for that single output neuron: 

min(V(e))  max(N(æ)) a) 
For example, if D, is the interval [—2,7] and we discover 
that minsen, (N(x)) = 1 and maxzep,(N(x)) = 3, then 
we are guaranteed that the property holds. We will focus on 
solving just the maximization problem, although the method 
that we present next can just as readily be applied towards the 
minimization problem. 

A straightforward way to solve the optimization problem 
in Eq. | is to encode the neural network as a mixed integer 
programming (MIP) instance [11], [31], [49], and then solve 
the problem using a MIP solver, which often employs a 
branch-and-bound procedure. While this approach has proven 
effective at verifying small DNNs, it faces a scalability 
barrier when it comes to larger networks. Therefore, before 
invoking the branch-and-bound procedure, existing solvers 
typically first seek to prove the property with abstraction-based 
techniques (symbolic-bound propagation), which have more 
tractable runtime. 


Symbolic-bound propagation. Symbolic-bound propaga- 
tion [21], [51] is a method of obtaining bounds on the concrete 
values a neuron may obtain. When applied to a network’s 
output neuron, it enables us to obtain an approximate solution 
to the optimization problems from Eq. 1, which may be 
sufficient to determine that the property holds. For example, 
continuing the example from before, if we are unable to 
exactly compute that maxzep, (N(x)) = 3 but can determine 
that max;ep,(N(x)) < 5, this is enough for concluding that 
the property in question holds. The idea underlying symbolic- 
bound propagation is to start from the bounds for the input 
layer provided in D;, and then propagate them, layer-by- 
layer, up to the output layer. It has been observed that while 
affine transformations allow us to precisely propagate bounds 
from a layer to its successor, activation functions introduce 
inaccuracies [45]. 

Before formally defining symbolic bound propagation, we 
start with an intuitive example using the network in Fig. 1. 
Let x’ denote the pre-activation values of the neurons in 
layer i, and let yê = o(a') denote their post-activation 
values; similarly, let xý and yi = (a) denote the pre- and 
post-activation values of neuron j in layer 7; and let UUs 
denote the concrete (scalar) lower- and upper-bound for xj, 
ie. l} < x} < ui when the DNN is evaluated on any input 
from D;. Assume that D; is the following box domain: 


and that we wish to compute bounds for the single output 
neuron, xë. 

We begin by propagating the bounds through the first affine 
layer. According to the network’s weights and biases, we get: 
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Fig. 1: A neural network. 


these equations allow us to compute concrete lower and upper 
bounds for each of these neurons, by substituting the input 
neurons (x8, 9, x9) with their corresponding concrete bounds 
(according to the sign of their coefficients). Using this process, 
we obtain: 
0€[-2,2]), «7 €[-2,2], złe [-1,1] 

this propagation, often referred to as interval arithmetic [15], 
is precise for individual neurons: indeed, xå, x} and x} can 
each take on any value in their respective computed ranges. 
However, much important information is lost when using just 
interval arithmetic: for example, it is impossible for x} and 
x} to simultaneously be assigned 2. As we will later see, 
symbolic-bound propagation addresses this issue by capturing 
some of the dependencies between neurons, and using these 
dependencies in producing tighter bounds. 

For now, we continue propagating our computed bounds to 
neurons yj, y} and ył. The output range of a ReLU is the 
non-negative part of its input range, which yields: 

€ (0, 2], t € [0,2], z € [0,1] 
and the next, affine layer is again handled using interval 
arithmetic. Using the expressions 


2 1.) Lip oly 0 ieee ere 
To = Yo tY, Yo tY +¥2; ~Y tY Ye 
and substituting each y} with the appropriate bound, we 
obtain: 

2 

€ [0,4], a? € [-2, 4, xro € [—4, 2] 

Unfortunately, as we soon show, the bounds computed for 
x2, x? x2 are not tight. A better approach is to compute 
symbolic bounds, as opposed to concrete ones, in a way that 
lets us carry additional information about the dependencies 
between neurons. In symbolic-bound propagation, we seek to 
express the upper and lower bounds of each neuron as a linear 
combination of neurons from earlier layers, using a process 
known as back-substitution. The main difficulty is to propagate 
these bounds across ReLU layers, which are not convex; and 


this is performed by using a triangle relaxation of the ReLU 
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function, illustrated in Fig. 2. Assume x € |l, u]; then, using 
this relaxation, we can deduce the following bounds: 


0<o(x) <0 ifu<0 
u<o(4) <a ifl>0 
ax < o(x) < (x —1) otherwise, for any 0<a<1 


Different symbolic bound propagation methods use different 
heuristics for choosing a [45], [54]; but this is beyond our 
scope here, and our proposed technique is compatible with 
any such heuristic. For our running example, we arbitrarily 
choose the values of a; and for our implementation, we use 
an existing heuristic [54]. 


—0.5 + 


Fig. 2: A triangle relaxation of a ReLU function for x € 
[—1, 1]. The solid lines correspond to the exact ReLU function, 
and the dotted lines represent the relaxed lower and upper 
bounds, for different values of a. 


Using this relaxation, we show how to pomp ile symbolic 
bounds that yield tighter pounds ad the x? neurons. First 
observe neuron x2, given as 72 = yj +y} = a(ah)+o(2). To 
obtain its lower bound we first substitute both yj = o(a}) and 
yt = o(x}) with their corresponding triangle relaxation lower 
bounds, with the choice of a = 0 for both (we note that it is 
possible to choose different a values for different variables). 
For the upper bound, we use the linear upper bound from the 
triangle relaxation. By using the bounds we already know for 
nodes in previous layers, we get that: 


oe > 0a, +02, = 0 
ob <5 (eh +2) +5 (+2) = 5 (oh tal) +2 
1 
= l ((a8 +29) + (29-22) 42 = a8 +253 


which indeed produces a tighter upper bound than the one 
obtained for xê using interval propagation. Similarly, we get 


that for x7: 


1 
zi > —5 (zo +2) +0: si + 0-2 
1 


z? <—O-ab +5 (of +2) +5 (oh +1) 
= 5 (ah +l) +15 = 5 (af z? +29) +15<3 
and for «3: 
s3 > —5 (0h +2) +0: (xt) — 5 (oh +1) 
= zg + z3) — 1.5 = x? +x? + x9) — 1.5 > -3 
9 \70 2 0 1 2 
rå < 0-ah +5 (0 +2) 0-25 


1 1 
sti +1= 5 (r0- 21) +1<2 


We have thus obtained the following bounds: 


x? € (0,3), 2} €[-2,3], x3 € [-3,2] 


We note that while these bounds are tighter than the ones 
produced by interval propagation, and are in fact optimal for 
x? x2, this is not the case for x3 (the optimal bounds are 
displayed in square brackets in Fig. 1). The reason for this 
sub-optimality is discussed in Section II. 

We continue to propagate our bounds through the next layer, 
obtaining: 


yo € [0,3], yt € [0,3], y2 € [0,2] 
and finally reach: 


zo = yo + Yi +yz = o(x0) + o(xi) + o(25) 


<a +5 (e? +2) +5 (23 +8) 


1 12 1 12 
= 2y + eyo + | = 20(x1) + xo(a2) + = 
lea, ie eae 12 
S2-5 (ei +2)+5-5 (e241 ag 

1 
= 26 —of + —979 +45 < 6.6 


10 
More generally, the back-substitution process for upper- 
bounding a neuron x* (assuming we already have valid bounds 
for all neurons in earlier layers) is iteratively defined as: 
max(2*) = max(W*~'o(a*-1)) 

< max(W/ 1 RRA) 

= max(W¥-' RE PW 2a (ak-?)) 

< max(W* “Re 2ywk aR 3 pk 2) 


0 
=... <max(W*"! Il (riw) x?) 
j=k-2 
(Biases and constants are handled similarly, and are omitted 
for clarity.) At each step, we can replace the variables of æt 


by their respective concrete bounds [li , uż], in an interval- 
arithmetic fashion, to obtain a valid concrete upper bound for 
the value of max(x¥). We refer to this operation as concretiza- 
tion. We call the matrices R¿ , Rj, the respective lower- and 
upper-bound relaxation matrices [54]. These matrices apply 
the appropriate triangle relaxation to each ReLU, allowing 
us to replace it with a linear bound, and are defined using 
the current symbolic bounds for each ReLU as well as the 
weight matrix of the layer the precedes it. The two matrices 


are defined such that Va € D;: 
wi Rix +c, <wio(a) < w, Rix + cu 


where cz and cy are scalar constants; and w; is a row vector 
containing the coefficients of each o(x;), resulting in linear 
bounds for the sum of ReLUs. A precise definition of these 
matrices appears in Sec. A of the Appendix; and a similar 
procedure can be applied for lower-bounding <*. 

At first glance, the iterative back-substitution process may 
seem counter productive; indeed, in each iteration where we 
move to an earlier layer of the network, we use a less- 
than-equals transition, which seems to indicate that the upper 
bound that we will eventually reach is more loose than the 
present bound. This, however, is not so; and the reason is 
the concretization process. When we concretize the bounds in 
some later iteration, it is possible that the known bounds for 
the variables in that layer of the network will lead to a tighter 
upper bound than the one that can be derived presently. More 
generally, this process can be regarded as a trade-off between 
computing looser expressions for the bound, but being able 
to concretize them over more exact domains — which could 
result in tighter bounds [45]. 


III. ERRORS IN BACK-SUBSTITUTION 


As previously mentioned, although symbolic-bound com- 
putation using back-substitution can derive tighter bounds 
than naive interval propagation, there are cases in which the 
computed bounds are sub-optimal: for example, while the 
bounds computed for x? and x3 were tight (i.e., there exists an 
input in D; for which they are met), the bounds for x? and 2 
were not. In this section, we analyze the reasons behind such 
sub-optimal bounds. We begin with the following definitions: 

Definition I (Optimal bias for bound): let f : R” —> R 
be a function and let U(x) = wa +b (w € R”,b € R) 
be a valid linear upper bound for f over the domain D, i.e., 
Va € D : U;(x) > f(a). We say that b is the optimal bias 
for U;(ax) if Vb" : b* < b, it holds that Už (æ) = wa + b* 
is no longer a valid upper bound for f. The definition for the 
optimal bias for f’s lower bound is symmetrical. 

An example of optimal and sub-optimal upper bounds 
appears in Fig. 3. In the graph depicted therein, we plot an 
upper bound for the function ReLU (x). The bias value of the 
first bound (in red) is 1; and as we can see, the resulting 
bound is not tight. When we set the bias value to 1/2, the 
bound becomes tight, equaling the function at points x = —1 
and x = 1, and so that is the optimal bias value for that bound. 
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0.5 


—1 —0.5 1 


Fig. 3: A simplified illustration of an optimal and sub-optimal 
bounds for a ReLU function over x € [—1, 1]. 


Definition 2 (Bound error): Let f : R” — R, and let g(a) 
be an upper bound for f over domain D, such that we have: 
Ve € D : g(x) > f(a). We define the error of g with respect 
to f as the function: E(x) = g(a) — f(a). The case for a 
lower bound is symmetrical. 

We observe that a linear bound g for f over the domain D; 
has optimal bias iff Ja € D; : E(a) = 0. We refer to any 
bound that has a sub-optimal bias, i.e. Va € D; : E(x) > 0, 
as a detached bound. We show that these detachments occur 
naturally as part of the back-substitution process, and are 
partially responsible for the discovery of sub-optimal concrete 
bounds. 

It is straightforward to see that the aforementioned triangle 
relaxation for ReLUs produces linear bounds that are bias- 
optimal for each individual ReLU. However, as it turns out, 
this may not be the case when multiple ReLUs are involved. 
In a typical DNN, a neuron’s value is computed as a weighted 
sum of the ReLUs of values from its preceding layer. Con- 
sequently, when we calculate an upper bound for the neuron 
using back-substitution, we are in fact upper-bounding a sum 
of ReLUs by summing their individual upper bounds. This can 
result in a detached bound, where, despite the fact that each 
ReLU was approximated using a bound with an optimal bias, 
the resulting combined bound does not have optimal bias. 

An illustration of this phenomenon appears in Fig. 4. Sub- 
figures a and b therein show the graph of ReLU functions, 
plotted along their triangle-relaxation upper bound (in orange). 
Sub-figure c then shows the graph of the sum of the two ReLU 
functions from sub-figures a and b, along with the sum of 
their individual upper bounds (again, in orange). As we can 
see, although the upper bounds in a and b touch the functions 
they are approximating in at least one point (and are hence 
bias-optimal), the bound in c is detached, and is hence not 
bias-optimal. Each figure in the lower row of Fig. 4 shows the 
over-approximation error of the figure directly above it. 

More formally, the error of the upper bound for ReLU() 
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Fig. 4: Illustration of the formation of detached bounds as 
a result of summed errors. Sub-figures a and b correspond 
to ys = ReLU(2§ + x2), yt = ReLU(2§ — x?) and their 
relaxed upper bounds (in orange); and sub-figure c corresponds 
to 72 = y4 +y} and its symbolic upper bound, computed using 
back-substitution. 


with current bounds | < 0 < u is: 
1 (2-1) —o(2) 


and we note that E(1) = E(u) = 0. In more complex cases, 
such as the case of the multivariate function xê = yj + yt 
depicted in Fig. 4, the coordinates where the bound error 
equals zero could be different for yj and ył — resulting in 
the bound obtained for Dre their sum, becoming detached from 
the true value of the function. We now show it for the case of 
x@ in greater detail: 


E(x) x € |l, ul 


2 


to = a(x) +o(2}) = a(xo +29) + a(x — z? 


) 


An upper bound is computed using the relaxations: 


1 

a(z +x?) < 5 (x6 z? 2) 
1 

o(x9 — x?) < z (%0 ay +2) 


where each relaxation has its own relaxation error: 


EX(a8, 2) = 5 (28 + 29 + 2) — ofa} +28) 
1 
E} (29,29) = 5(28 + 29 + 2) — o(@9 - 2%) 


The relaxed linear bound obtained is: 


2 


1 Í 
zo < z(o Fat +2) + 5h20 +21 +2) = z0 +2 


And its error is the sum of the errors of its summands: 


= E + Ei 


Fota l£, x?) = 
= 06 +2—o(28 + 2°) — olz? — af 


) 
We note that: 
=i) =0 


However: 
min( Eta) = Eriotal (—1, 2) =1 
The reason for this is that at the coordinates (—1,—1) and 


(1,1) where E (—1,-1) = EO (1,1) = 0, we have that 
Ei (-1,-1) = Ej (1,1) = 1; and vice-versa, for the coordi- 


) 
nates (—1,1) and (1,—1), where E} (—1,1) = E} (1,-1) = 
0 and El (—1,1) = Eg (1, -1) = 1. The optimal linear bound 
for 
x = o(x9 + £1) + (2h — 21) 


is in fact x < x8 + 1, which is the bias-optimal version of 
the existing linear bound of x3 < x} + 2. 


IV. DEEPMIP: MINIMIZING BACK-SUBSTITUTION 
ERRORS 


During a back-propagation execution, the over- 
approximations of individual ReLUs are repeatedly summed 
up, which leads to bounds that become increasingly more 
detached with each iteration — and this results in very loose 
concrete bounds that hamper verification. We now describe 
our method, which we term DeepMIP, for “tightening” 
detached bounds, with the goal of eventually obtaining 
tighter concrete bounds. The idea is to alter the back- 
propagation mechanism, so that in each iteration it minimizes 
the sum of errors that result from the relaxation of the 
current activation layer — effectively pushing loose upper 
bounds down towards the function, by decreasing their bias 
values (a symmetrical mechanism can be applied for lower 
bounds). More specifically, we propose to rewrite the general 
back-substitution rule for a single iteration as follows: 


max(2*) = max(W*7*o(a*-1)) 
= max (WORE ter 
= (wF Iipr 2k a wk lo(x" H) 
= max(W/ | RẸ? æ"! — E-t) 


< max(W/ 1 RE?2*-1) — min(E*- t) 


Observe that while min(Æ*-!) is non-convex, it contains no 
nested ReLUs, and can often be efficiently solved by MIP 
solvers [49]. Thus, as DeepMIP performs the iterative back- 
substitution process, it can invoke a MIP solver to minimize 
the error in each iteration, and use it to improve the deduced 
bounds. The pseudo-code for the algorithm appears in the 
full version of this paper [56]. Observe that MiniMIP can 
be regarded as a generalization of modern back-substitution 
methods [45], [54], in the sense that they only use the non- 
negativity of the error to produce a trivial bound: 


min(E*-!) = min(W*' RE 7a! — WE e(a*-1)) > 0 
which is correct, since the error of an upper bound is non- 
negative by definition (in the lower bound case, the error is 
non-positive, and so 0 can be used as a trivial upper bound). 
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To continue our computation we denote the error caused 
by the over-approximation of the activation of layer ¢ during 
back-substitution as: 

t-1 
B'=Ws' [|| (RLW!) (Rp ‘a! — o(æ*)) 
j=k-2 


(2) 


In the definition above, 2 is the index of the neuron being 
bounded by the back-substitution. We get: 


max(r*) < max(W) | RE 2a*-) — min(E*-') 
= max(W*7' RE? W-29(a*-*)) — min(E*-?) 
= max(WE-1 RE-2yyk-2 Rk-2gk-2 — pk-2) 
— min(E*~1) 
< max(W-1 RE-2yyk-2 RE-2pk—2) 


— min(E*~?) — min(E*~1) 


0 
< max(W*"! II (Ri,W?)x°) — 
j=k—2 j 


0 
5 min(E”) 
=k-1 

Finally, the maximization problem is transformed into a linear 
sum over a box domain, which is easy to solve. Since each E/ 
is shallow (contains no nested ReLUs), it can be minimized 
efficiently using MIP solvers, and each non-trivial minimum 
that is found will improve the tightness of the final upper 
bound. However, we note that the number of MIP problems 
generated by this process increases linearly with the depth of 
the neuron within the network — i.e., for a neuron in layer 
k, there are k minimization problems to solve. For deeper 
networks, especially ones with large domains or ones where 
many layers only have very loose bounds, minimizing the error 
terms could become computationally expensive. 
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Optimization: Direct MIP encoding. As part of its operation, 
DeepMIP dispatches MIP problems, each corresponding to 
the over-approximation error of a particular layer. Specifically 
when it over-approximates the first layer: 


1 1 
max(WS-" [| (Ri Wwio(Ww?z°))- XO min(£’) 
j=k—2 j=k—2 
0 
<max(W}"! || (RL W2)x°) — min(E°) 
j=k—-2 
1 
— y min(E’) 
j=k—2 


it will directly solve the linear optimization problem: 
0 
max(Wf-! [| (Riwi?) 
j=k-2 
and use a MIP solver to solve: 
1 


min( E?) = min (we J [ @LW)(Rt2" - a(a°)) 
j=k-2 


We observe that in this particular case, since we reached the 
input layer, the initial term can instead be directly solved as 
a separate MIP query: 
1 
max(W*~1 II (Ri, W!)o(W°2°)) 
j=k-2 

which may result in tighter bounds, since it prevents any 
additional imprecision. We note that this optimization to 
DeepMIP generalizes the common practice of directly finding 
the concrete bounds of the neurons in the first layer using MIP 
solvers, and only applying back-substitution from the second 
layer onward [37], [54]. 

We illustrate this approach by repeating the back- 
substitution process for xg from our running example: 

max(zġ) = max(yp + yi + y2) 


= max(o(#2) + a(x?) + o(x3)) 


= max (ous + yt) + o(—yo + ut + y2) 


+ o(—99 + yt — 4) 


= max(A — EF?) < max(A) — min( Eĉ) 


where 
3 2 12 
A= (yy tui) + tutit. yo tyi—y2)4 ; 
1 12 
=? ce doh 
YW 5 v2 5 


and E? is defined as per Eq. 2: 


3 
Ex = (yo + yi) + =(—yo + yt + y3) 


5 
+5(-ub +h a) -o yo + yi) 
— o(—y9 + Y1 + ¥2) — o(—¥9 + yi — y2) 
= Qyi + iyi + = —o(y9 +91) 
— a(-y + ut + y2) — o(—¥9 + yi — y2) 


Simplifying these expressions, we get that 
max(x@) < max(A) — min(E7,) 
= max(2yt + Zyl + >) — min(E7,) 
Using a MIP solver to find the minimum of E? over the 


variables of y' reveals that min(E7,) = =. We substitute this, 
and get: 


1 12 2 
3 < 2 1 | 1 i 
max(xp9) < max(2y; 5 2 5 ) 5 
Finally, since we have reached the first layer, we write: 
1 12 2 
3 ea 2 1 | 1 i 
max(xġ) < max(2y; p92 5 ) 5 
1 12 2 
= max(2o(a}) + 50 (2) + = ae 
1 12 2 
= max(2ø (x9 — x9) + 50 (#2) Ae = S 


and then, using our proposed enhancement, we directly solve 
this maximization over the input layer instead of back- 
substituting it any further. The MIP solver replies that: 


1 12 2 
max(2o(a) — x?) + 50 (v2) + z) = 65 
and we then substitute this value to obtain: 
2°32 
4) <6. —-—=6 
max(xg) < 5 5 


As we can see, minimizing the errors by using MIP (which 
is very fast in practice) allows us to back-substitute bounds 
with optimal bias, which yields tighter bounds for the output 
variable. 


MiniMIP. While DeepMIP produces very strong bounds, for 
each neuron it must solve multiple MIP instances during back- 
substitution — many of them for bounds that may already 
be bias-optimal. This large number of instances to solve can 
result in a large overhead, and makes it worthwhile to explore 
heuristics for only solving some of these instances. 

To illustrate this, we propose a particular, aggressive heuris- 
tic that we call MiniMIP. Instead of minimizing all error terms 
during back-substitution, MiniMIP only solves the final query 
in this series — that is, the query in which the bounds of the 
current layer are expressed as sums of ReLUs of input neurons. 
This approach significantly reduces overhead: exactly one MIP 
instance is solved in each iteration, regardless of the depth of 
the layer currently being processed. As we later see in our 
evaluation, even this is already enough to achieve state-of- 
the-art performance and very tight bounds; and the resulting 
queries can be solved very efficiently [49]. 


V. EVALUATION 


Implementation. For evaluation purposes, we created a proof- 
of-concept implementation of our approach in Python. The 
implementation code, alongside all the benchmarks described 
in this section, is publicly available online [55]. Our implemen- 
tation uses the PyTorch library [40] for computing the optimal 
value of a for each ReLU’s triangle relaxation, as is done in 
other tools [54]. We use Gurobi [26] as the MIP solver for the 
minimization of errors and direct concretization of bounds. 
We ran all experiments on a compute cluster consisting of 
Xeon E5-2637 CPUs, and a 2-hour timeout per experiment. 
We note that our implementation currently runs on CPUs only, 
and extending it to support GPUs is left for future work. 


Abstraction refinement cascade. For each verification query, 
prior to applying our iterative error minimization scheme, 
we configured our implementation to first run a light-weight, 
“ordinary” symbolic-bound propagation pass. Specifically, we 
ran a single pass of the DeepPoly mechanism [45]. A similar 
technique is applied by other tools [37]. 


Benchmarks. We evaluated our approach on fully-connected, 
ReLU networks trained over the MNIST dataset, taken from 
the ERAN repository [19]. The topologies of the networks we 
used appear in Table I. 
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TABLE I: The DNNs used in our evaluation. 
Dataset | Model Type | Neurons | Hidden Layers | Activation 
6 x 100 510 5 
9 x 100 810 8 
MNIST 6 x. 200 FC TOIO 5 ReLU 
9 x 200 1610 8 


For verification queries, we followed standard practice [31], 
[37], [54], and attempted to prove the adversarial robustness 
of the first 1000 images of the MNIST test set: that is, we used 
verification to try and prove that e-perturbations to correctly 
classified inputs in the dataset cannot change the classification 
assigned by the DNN. 


We compared the DeepMIP approach (specifically, Min- 
iMIP) to two state-of-the-art verification approaches [9]: 
the PRIMA solver [37], and our implementation of the a- 
CROWN method [54], which represents the state of the art 
in symbolic-bound tightening with back-substitution. Indeed, 
many other verification tools integrate back-substitution with 
additional techniques, such as search-based techniques [32] or 
abstraction-refinement [7], making it more difficult to measure 
the effectiveness of the back-substitution component alone. 
However, since the a-CROWN implementation in our eval- 
uation also served as the baseline back-substitution method to 
which we added our methods, any difference between the two 
is solely due to the addition of our suggested technique. The 
results of our experiments are summarized in Table II. Recall 
that symbolic-bound propagation techniques are incomplete, 
and may fail to prove a given query; the Solved columns indi- 
cate the number of instances (out of 1000) that each method 
was able to prove to be robust to adversarial perturbations. The 
Time columns indicate the run time of each method (including 
timeouts), averaged over the 1000 benchmarks solved. 


Our results clearly indicate the superiority of the bounds 
discovered by DeepMIP: indeed, in all categories, our ap- 
proach was able to solve the largest number of instances, 
solving a total of 2378 instances, compared to 2183 instances 
solved by PRIMA (198 extra instances solved) and 1087 
instances solved by a-CROWN (1291 extra instances solved). 
These improvements come with an overhead, due to the 
additional MIP queries that need to be solved: our approach 
is approximately 5.6 times slower than a-CROWN, and 2.5 
times slower than PRIMA. Furthermore, DeepMIP timed out 
on 2 out of the 3829 total benchmarks tested (~ 0.05%), while 
PRIMA and a-CROWN did not have any timeouts. 


The main conclusions that we draw from these experiments 
are that (i) the DeepMIP approach has a significant potential 
for solving queries that other approaches cannot; and (ii) ad- 
ditional work, in the form of improved heuristics, engineering 
improvements, and support for GPUs is still required to make 
our approach faster. Our results also indicate that a portfolio- 
based approach, which starts from light-weight techniques and 
then progresses towards DeepMIP for difficult queries, could 
enjoy the benefits of both worlds. 
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VI. RELATED WORK 


The topic of DNN verification has been receiving significant 
attention from the formal methods community, and various 
tools and methods and have been proposed for addressing it. 
These include techniques that leverage SMT solvers (e.g., [27], 
[32], [39], [53]), LP and MILP solvers (e.g., [13], [15], 
[36], [49]), reachability analysis [47], abstraction-refinement 
techniques [7], [16], [17], and many others. The techniques 
most related to DeepMIP are those that rely on the propagation 
of symbolic bounds using abstract interpretation (e.g., [21], 
[50]-[52]). Recent work has also extended beyond answering 
binary questions about DNNs, instead targeting tasks such as 
automated DNN repair [23], [34], DNN simplification [22], 
[35], ensemble selection [3], and quantitative verification and 
optimization [10], [46]; and also the verification of recurrent 
neural networks [28], [41], [57] and reinforcement-learning 
based systems [4], [18], [29]. Our proposed techniques could 
be integrated into any number of these approaches. 

Bound propagation has been playing a significant part in 
DNN verification efforts for the past few years. Starting 
with interval-arithmetic-based propagation [31] and optimiza- 
tion queries for individual neurons [15], [49], these ap- 
proaches have progressed to use various relaxations and over- 
approximations for individual neurons [21], [45], [51] and sets 
thereof [37], [38], [44], culminating in highly sophisticated 
approaches [37], [54]. We consider our work as another step 
in this very promising research direction. 


VII. CONCLUSION AND FUTURE WORK 


We presented an enhancement to the popular back- 
substitution procedure, which includes a formulation of the 
over-approximation errors introduced during back-substitution. 
These errors can then be minimized, in order to greatly tighten 
the resulting bounds. Our approach achieves tighter bounds 
than state-of-the-art approaches, but at the cost of longer 
running times; and we are currently exploring methods for 
expediting it. Specifically, moving forward, we intend to focus 
on adding support for GPUs; on better refinement heuristics; 
on better MIP encoding [6]; and also on improving the core 
algorithm to utilize previously calculated bounds and errors. 
Furthermore, we intend to generalize our methods to other 
abstract domains, and also to integrate them with search-based 
techniques. 


ACKNOWLEDGEMENTS 


The project was partially supported by the Israel Science 
Foundation (grant number 683/18) and by the Binational 
Science Foundation (grant number 2020250). 


APPENDIX A 
RELAXATION MATRICES 


The matrices Rt, and Rj, are how we apply the triangle 
relaxation during back-substitution over layer t. for example 
if: 

gtt! = 


j= (x) — 20(24) 


TABLE II: Comparing DeepMIP to a-CROWN and PRIMA. 


Model z a-CROWN PRIMA DeepMIP (MiniMIP) 
Solved | Time (seconds) | Solved | Time (seconds) | Solved | Time (seconds) 

6 x 100 | 0.026 207 38 504 123 581 302 

9x 100 | 0.026 223 88 427 252 463 452 

6 x 200 | 0.015 349 93 652 222 709 801 

9 x 200 | 0.015 308 257 600 462 625 1121 
Total 1087 476 2183 1059 2378 2676 
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Abstract—Deep neural networks (DNNs) have become the 
technology of choice for realizing a variety of complex tasks. 
However, as highlighted by many recent studies, even an im- 
perceptible perturbation to a correctly classified input can lead 
to misclassification by a DNN. This renders DNNs vulnerable 
to strategic input manipulations by attackers, and also over- 
sensitive to environmental noise. To mitigate this phenomenon, 
practitioners apply joint classification by an ensemble of DNNs. 
By aggregating the classification outputs of different individual 
DNNs for the same input, ensemble-based classification reduces 
the risk of misclassifications due to the specific realization of 
the stochastic training process of any single DNN. However, 
the effectiveness of a DNN ensemble is highly dependent on its 
members not simultaneously erring on many different inputs. In 
this case study, we harness recent advances in DNN verification 
to devise a methodology for identifying ensemble compositions 
that are less prone to simultaneous errors, even when the input 
is adversarially perturbed — resulting in more robustly-accurate 
ensemble-based classification. Our proposed framework uses a 
DNN verifier as a backend, and includes heuristics that help 
reduce the high complexity of directly verifying ensembles. More 
broadly, our work puts forth a novel universal objective for 
formal verification that can potentially improve the robustness 
of real-world, deep-learning-based systems across a variety of 
application domains. 


I. INTRODUCTION 


In recent years, deep learning [33] has emerged as the 
state-of-the-art solution for a myriad of tasks. Through the 
automated training of deep neural networks (DNNs), engineers 
can create systems capable of correctly handling previously 
unencountered inputs. DNNs excel at tasks ranging from 
image recognition and natural language processing to game 
playing and protein folding [2], [21], [38], [48], [74], [75], 
and are expected to play a key role in various complex 
systems [15], [44]. 

Despite their immense success, DNNs suffer from severe 
vulnerabilities and weaknesses. A prominent example is the 
sensitivity of DNNs to adversarial inputs [34], [49], [80], i.e., 
slight perturbations of correctly-classified inputs that result 
in misclassifications. The susceptibility of DNNs to input 
perturbations involves two risks that limit the applicability 
of deep learning to mission-critical tasks: (1) falling victim 
to strategic input manipulations by attackers, and (2) failing 
to generalize well in the presence of environmental noise. In 
light of the above, recent work has focused on enhancing the 
robustness of DNN-based classification to adversarial inputs 
while preserving accuracy [13], [29], [62], [82], [97]. Infor- 
mally, a classifier is robustly accurate (aka astute [86]) with 
respect to a given distribution over inputs, if it continues to 
correctly classify inputs drawn from this distribution, with high 
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probability, even when these inputs are arbitrarily perturbed 
(up to some maximally allowed perturbation). 

We focus here on a classic technique for improving clas- 
sification quality [9], [52]: combining the outputs of an 
ensemble [28], [37], [81] of DNN-based classifiers on an 
input to derive a joint classification decision for that input. 
By incorporating the outputs of independently-trained DNNs, 
ensembles mitigate the risk of misclassification of a single 
DNN due to a specific realization of its stochastic training 
process and the specifics of its training data traversal. For a 
DNN ensemble to provide a meaningful improvement over 
utilizing a single DNN, its members should not frequently 
misclassify the same input. Consider, for instance, an extreme 
example, where an ensemble with k = 10 members is 
used, but for some part of the input space, the 10 DNNs 
effectively behave identically, making mistakes on the exact 
same inputs. In this scenario, the ensemble as a whole is no 
more robust on this input subspace than each of its individual 
members. Our objective is to demonstrate how recent advances 
in DNN verification [40], [45] can be harnessed to provide 
system designers and engineers with the means to avoid such 
scenarios, by constructing adequately diverse ensembles. 

Significant progress has recently been made on formal 
verification techniques for DNNs [1], [8], [11], [12], [26], 
[56], [67], [76], [90]. The basic DNN verification query is to 
determine, given a DNN N, a precondition P, and a postcon- 
dition Q, whether there exists an input x such that P(x) and 
Q(N(x)) both hold. Recent verification work has focused on 
identifying adversarial inputs to DNN-based classification, or 
formally proving that no such inputs exist [30], [35], [58]. We 
demonstrate the applicability of DNN verification to solving 
a new kind of queries, pertaining to DNN ensembles, which 
could significantly boost the robustness of these ensembles 
(as opposed to just measuring the robustness of individual 
DNNs). We note that despite great strides in recent years [47], 
[58], [76], even state-of-the-art DNN verification tools face 
severe scalability limitations. This renders solving verification 
queries pertaining to ensembles extremely challenging, since 
the complexity of this task grows exponentially with the 
number of ensemble members (see Section II). 

In this case-study paper, we propose and evaluate an effi- 
cient and scalable approach for verifying that different ensem- 
ble members do not tend to err simultaneously. Specifically, 
our scheme considers small subsets of ensemble members,! 


'While our technique is applicable to subsets of any size, we focused on 
pairs in our evaluation, as we later elaborate. 
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and dispatches verification queries to seek perturbations of 
inputs for which all members in the subset err simultaneously. 
By identifying such inputs, we can assign a mutual error 
score to each subset. Using these mutual error scores, we 
compute, for each individual ensemble member, a uniqueness 
score that signifies how often it errs simultaneously with other 
ensemble members. This score can be used to detect the 
“weakest” ensemble members, i.e. those most prone to erring 
in parallel to others, and replace them with fresh DNNs — 
thus enhancing the diversity among the ensemble members, 
and improving the overall robust accuracy of the ensemble. 

To evaluate our scheme, we implemented it as a proof- 
of-concept tool, and used this tool to conduct extensive ex- 
perimentation on DNN ensembles for classifying digits and 
clothing items. Our results demonstrate that by identifying the 
weakest ensemble members (using verification) and replac- 
ing them, the robust accuracy of the ensemble as a whole 
may be significantly improved. Additional experiments that 
we conducted also demonstrate that our verification-driven 
approach affords significant advantages when compared to 
competing, non-verification-based, methods. Together, these 
results showcase the potential of our approach. Our code and 
benchmarks are publicly available online [6]. 

The rest of the paper is organized as follows. Section I con- 
tains background on DNN ensembles and DNN verification. 
In Section III we present our verification-based methodology 
for ensemble selection, and then present our case study in 
Section IV. Next, in Section V we compare our verification- 
based approach to state-of-the-art, gradient-based, methods. 
Related work is covered in Section VI, and we conclude and 
discuss future work in Section VII. 


II. BACKGROUND 


Deep Neural Networks. A deep neural network (DNN) [33] 
is a directed graph, comprised of layers of nodes (also known 
as neurons). In feed-forward DNNs, data flows sequentially 
from the first (input) layer, through a sequence of intermediate 
(hidden) layers, and finally into an output layer. The network’s 
output is evaluated by assigning values to the input layer’s 
neurons and computing the value assignment for neurons in 
each of the following layers, in order, until reaching the 
output layer and returning its neuron values to the user. In 
classification networks, which are our subject matter here, each 
output neuron corresponds to an output class; and the output 
neuron with the highest value represents the class, or label, 
which the particular input is being classified as. 

Fig. 1 depicts a toy DNN. It has an input layer with two 
neurons, followed by a weighted sum layer, which computes 
an affine transformation of values from its preceding layer. For 
example, for input V; = [1, —5]”, the second layer’s computed 
values are Vz = [—8, 1]”. Next is a ReLU layer, which applies 
the ReLU function ReLU(x) = max(0, x) to each individual 
neuron, resulting in V3 = [0, 1]7. Finally, the network’s output 
layer again computes an affine transformation, resulting in 
the output V4 = [6,3]7. Thus, input [1,—5]” is classified as 
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Fig. 1: A toy DNN. 
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the label corresponding to neuron v4. For additional details, 
see [33]. 


Accuracy, Robustness, and Deep Ensembles. The weights 
of a DNN are determined through its training process. In 
supervised learning, we are provided a set of pairs (x;, li) 
drawn according to some (unknown) distribution D, where x; 
is an input point and l; is a ground-truth label for that input. 
The goal is to select weights for the DNN N that maximize 
its accuracy, which is defined as: Priœ~p(N (x) = l) (we 
slightly abuse notation, and use N(x) to denote both the 
network’s output vector, as well as the label it assigns x). 

We restrict our attention to the classification setting, in 
which labels are discrete. The training of a DNN-based classi- 
fier is typically a stochastic process. This process is affected, 
for example, by the initial assignment of weights to the DNN, 
the order in which training data is traversed, and more. A 
prominent method for avoiding misclassifications originating 
from the stochastic training of a single DNN is employing 
deep ensembles. A deep ensemble is a set E = {Nj,..., Nx} 
of k independently-trained DNNs. The ensemble classifies an 
input by aggregating the individual classification outputs of 
its members (see Fig. 2). The collective decision is typically 
achieved by averaging over all members’ outputs. Ensembles 
have been shown to often achieve better accuracy than their 
individual members [9], [52], [57], [92]. 

A critical condition for the success of ensemble-based 
classifiers is that the ensemble members’ misclassifications 
are not strongly correlated [53], [63], [79]. This key property 
is crucial in order to avoid a scenario where many different 
members of the ensemble frequently make mistakes on the 
same input, causing the ensemble as a whole to also err on 
that input. Heuristics for achieving diversity across ensemble 
members include, e.g., training the members simultaneously 
with diversity-aware loss [43], [52], randomly initializing 
different weights for the ensemble members [50], and other 
methods [63], [73]. 

Since the discovery of adversarial inputs, practitioners have 
become interested in DNNs that are not only accurate but 
also robustly accurate. We say that a network N is e-robust 
around the point x if every input point that is at most € away 
from « receives the same classification as x: |x" — æ|| < 
e => N(x) = N(x'), where N(x) is the label assigned to 
x; and the definition of accuracy is generalized to e-robust 
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Fig. 2: An ensemble comprising three DNNs. Each input 
vector is independently classified by all three networks, and 
the results are aggregated into a final classification. 


accuracy as follows: Proz p~pllle — z|| < « > N(x’) = 1). 
While improvements in accuracy afforded by ensembles are 
straightforward to measure, this is typically not the case for 
robust accuracy, as we discuss in Section II. 


DNN Verification. Given a DNN N, a verification query on 
N specifies a precondition P on N’s input vector x, and a 
postcondition Q on N’s output vector N(x). A DNN verifier 
needs to determine whether there exists a concrete input £o 
that satisfies P(zo) A Q(N(ao)) (the SAT case), or not (the 
UNSAT case). Typically, P and Q are expressed in the logic 
of linear real arithmetic. For instance, the e-robustness of a 
DNN around a point x can be phrased as a DNN verification 
query, and then dispatched using existing technology [30], 
[45], [85]. The DNN verification problem is known to be NP- 
complete [46]. 


II. IMPROVING ROBUST ACCURACY USING VERIFICATION 
A. Directly Quantifying Robust Accuracy is Hard 


In order to construct a robustly-accurate ensemble € with 
k members, we train a set of n > k DNNs and then seek to 
select a subset of k DNNs that provides high robust accuracy. 
This method of training multiple models and then discarding a 
subset thereof is known as ensemble pruning, and is acommon 
practice in deep-ensemble training [14], [98]. In our case, a 
straightforward approach to do so would be to quantify the 
robust accuracy for all possible k-sized DNN-subsets, and then 
pick the best one. This, however, is computationally expensive, 
and requires an accurate estimate of the robust accuracy of an 
ensemble. 

A natural approach for estimating the e-robust accuracy of 
a DNN is to verify, for many points in the test data, that the 
DNN yields an accurate label not only on each data point 
itself, but also on each and every input derived from that data 
point via an e-perturbation [30]. The fraction of tested points 
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for which this is indeed the case can be used to estimate the 
accuracy of the classifier on the underlying distribution from 
which the data is generated. 

A similar process can be performed for an ensemble € = 
{N,,..., Nz}, by first constructing a single, large DNN Ne 
that aggregates €’s joint classification, and then verifying its 
robustness on a set of points from the test data (see the 
extended version of this paper [7]). However, this approach 
faces a significant scalability barrier: the DNN ensemble, 
Ne, comprised of all k member-DNNs is (roughly) k times 
larger than any of the N;’s, and since DNN verification 
becomes exponentially harder as the DNN size increases, 
Ne’s size might render efficient verification infeasible. As we 
demonstrate later, this is the case even when the constituent 
networks themselves are fairly small. Our proposed method- 
ology circumvents this difficulty by only solving verification 
queries pertaining to very small sets of DNNs. 


B. Mutual Error Scores and Uniqueness Scores 


In general, the less likely it is that members of an ensemble 
err simultaneously with other members, the more accurate the 
ensemble is. This motivates our definition of mutual error 
scores below. 


Definition | (Agreement Points): Given an ensemble € = 
{Ni,No,..., Ne}, we say that an input point xo is an 
agreement point for E if there is some label yo such that 
N;(Xo) = yo for all i € [k]. We let E(w) denote the label yo. 

As we later discuss, the e-neighborhoods of agreement 
points are natural locations for detecting hidden tendencies 
of ensemble members to err together. 


Definition 2 (Mutual Errors): Let E be an ensemble, and 
let xo be an agreement point for E. Let Bro, be the e-ball 
around zo, Beo, = {x | ||£— xolļloo < €}. We say that Ny and 
Nə have a mutual error in B if there exists a point x € Bay. 
such that Nı (x) Æ E(xo) and No(x) 4 E(x). 


Intuitively, if Ny and Nə have many mutual errors, incorpo- 
rating both into an ensemble is a poor choice. This naturally 
gives rise to the following definition: 


Definition 3 (Mutual Error Scores): Let A be a finite set 
of m agreement points in an ensemble €’s input space, and let 
Bı, Bg,..., Bm denote the e-balls surrounding the points in 
A. Let Nj, Na denote two members of €. The mutual error 
score of N, and Nə with respect to € and A is denoted by 
MEe (Ni, N2), and defined as: 


ME¢, (Ni, N2) = 
Hi | Ny, and No have a mutual error in B;}| 


m 
Observe that ME¢,4(Ni, N2) is always in the range [0, 1]. 
The closer it is to 1, the more mutual errors N; and Nə have, 
making it unwise to place them in the same ensemble. 


Definition 4 (Uniqueness Scores): Given an ensemble € = 
{N1, No,..., Nn} and a set A of agreement points for E, we 
define, for each ensemble member N;, the uniqueness score 
for N; with respect to € and A, USe 4(NNj), as: 


pape MEg¢,a(Ni, Nj) 


n— 1 


USe,4(Ni) =T 


The uniqueness score (US) of N; is the complement of its 
average mutual error score with the other ensemble members. 
When this score is close to 0, N; tends to err simultaneously 
with other members of the ensemble on points in A. In 
contrast, the closer the uniqueness score is to 1, the rarer it 
is for N; to misclassify the same inputs as other members of 
the ensemble. Hence, ensemble members with low uniqueness 
scores are, intuitively, good candidates for replacement. 

We point out that our definitions above can naturally be 
generalized to larger subsets of the ensemble members — thus 
measuring robust accuracy more precisely, but rendering these 
measurements more complex to perform in practice. 


Computing Mutual Errors. The only computationally com- 
plex step in determining the uniqueness scores of individual 
ensemble members is computing the pairwise mutual errors 
for the ensemble. To this end, we leverage DNN verification 
technology. Specifically, given two ensemble members Nı 
and N2, an agreement point a for the ensemble with label 
l, and € > 0, an appropriate DNN verification query can 
be formulated as follows. First, we construct from MN; and 
Nə a single, larger DNN N, which captures Ny and Nə 
simultaneously processing a shared input vector, side-by-side. 
This network N is then passed to a DNN verifier, with 
the precondition that the input be restricted to B, an e-ball 
around a, and the postcondition that (1) among N’s output 
neurons that correspond to the outputs of N, the neuron 
representing l not be maximal, and (2) among N’s output 
neurons that correspond to the outputs of No, the neuron 
representing l not be maximal. Such queries are supported 
by most available DNN verification engines. We note that this 
encoding (depicted in Figure 3), where two networks and their 
output constraints are combined into a single query, is crucial 
for finding inputs on which both DNNs err simultaneously. For 
additional details, see the extended version of this paper [7]. 


C. Ensemble Selection using Uniqueness Scores 


An Iterative Scheme. Building on our verification-based 
method for computing mutual error scores, we propose an 
iterative scheme for constructing an ensemble. Our scheme 
consists of the following steps: 


1) independently train a set M of n DNNs, and identify a 
set A of m agreement points that are correctly classified 
by all n DNNs.” This is done by sequentially checking 
points from the validation dataset; 

2) arbitrarily choose an initial candidate ensemble € of size 
k <n; 


2In our experiments, we arbitrarily chose k = 5, n = 10 and m = 200. 
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3) compute (using a verification engine backend) all mutual 
error scores for the DNN members comprising E, with 
respect to A; 
compute the uniqueness score for each ensemble member, 
and identify a DNN member N; with a low score; 
identify a fresh DNN Nf, not currently in E, that has a 
higher uniqueness score than JV), if one exists, and replace 
N, with Ny. Specifically, identify a DNN Ny € N \ 
€, such that the uniqueness score of Ny with respect 
to the ensemble € \ {N} U {Ny} and the point set A, 
namely USe\{n,}u{N;}, a (Nf), is maximal. If this score 
is greater than US¢,4(Ni), replace N; with Ny, i.e. set 
E := E \ {N1} U {NF}; and 
6) repeat Steps (3) through (5), until no Ny is found or until 
the user-provided timeout or maximal iteration count are 
exceeded. 

Intuitively, after starting with an arbitrary ensemble, we run 
multiple iterations, each time trying to improve the ensemble. 
Specifically, we identify the “weakest” member of the current 
ensemble, and replace it with a fresh DNN that obtains a 
higher uniqueness score relevant to the remaining members 
— thus ensuring that each change that we make improves the 
overall robust accuracy on the fixed set of agreement points. 

The greedy search procedure is repeated for the new can- 

didate ensemble, and so on. The process terminates after a 
predefined number of iterations is reached, when the process 
converges (no further improvement is achievable on the fixed 
set of agreement points), or when a predefined timeout value 
is exceeded. 
On the Importance of Agreement Points. Our iterative 
scheme for constructing an ensemble starts with an arbi- 
trary selection of k candidate members, and then computes 
the uniqueness score for each member. As mentioned, the 
uniqueness scores are computed with respect to a fixed set of 
agreement points, pre-selected from the validation data (which 
is labeled data, not used for training the DNNs). 

We point out that agreement points are data points on which 
there is overwhelming consensus among ensemble members, 
despite the specific realization of the training process of each 
member. As such, agreement points correspond to data points 
that are “easy” to label correctly. Consequently, data points 
in close proximity of an agreement point are rarely classified 
differently than the agreement point by an individual ensemble 
member, let alone by multiple members simultaneously. As 
our objective is to expose implicit tendencies of ensemble 
members to err together, the close neighborhood of agreement 
points is a natural area for seeking joint deviations from 
the consensual label (which are expected to be extremely 
rare). In our evaluation, we computed uniqueness scores based 
solely on correctly-classified agreement points and ignored any 
incorrectly-classified agreement points.* 

As we later demonstrate, a small set of correctly-classified 
agreement points from the validation set can be used, in 


4) 


5) 


3For example, in our MNIST experiments 99.7% of the agreement points 
were correctly classified by all individual DNNs, and by the ensemble as a 
whole. 


practice, to identify ensemble members that tend to err simul- 
taneously on other data points. We note that this is also the 
case even when the chosen agreement points are all identically 
labeled. 


Monotonicity and Convergence. Using our approach, an 
ensemble member is replaced with a fresh DNN only if 
this replacement leads to strictly fewer joint errors with the 
remaining members on the fixed set of agreement points. 
Thus, the total number of joint errors decreases with every 
replacement; and, as this number is trivially lower-bounded 
by 0, this (“potential-function” style) argument establishes the 
process’s monotonicity and convergence. 

By iteratively reducing the number of joint errors across 
all pairs of chosen ensemble members, our iterative process 
improves the robust accuracy of the resulting ensemble on the 
fixed set of agreement points. This, however, does not guar- 
antee improved robust accuracy over the entire input domain. 
Nonetheless, we show in Section IV that such an improvement 
does typically occur in practice, even on randomly sampled 
subsets of input points (which are not necessarily agreement 
points). 


IV. CASE STUDY: MNIST AND FASHION-MNIST 


Below, we present the evaluation of our methodology 
on two datasets: the MNIST dataset for handwritten digit 
recognition [51], and the Fashion-MNIST dataset for clothing 
classification [91]. Our results for both datasets demonstrate 
that our technique facilitates choosing ensembles that provide 
high robust accuracy via relatively few, efficient verification 
queries. 

The considered datasets are conducive for our purposes 
since they allow attaining high accuracy using fairly small 
DNNs, which enables us to directly quantify the robust accu- 
racy of an entire ensemble, by dispatching verification queries 
that would otherwise be intractable (see Section III-A). This 
provides the ground truth required for assessing the benefits 
of our approach. The scalability afforded by our approach is 
crucial even for handling the relatively modest-sized DNNs 
considered: on the MNIST data, for instance, mutual-error 
verification queries for two ensemble members typically took 
a few seconds, whereas verification queries involving the 
full ensemble of five networks often timed out (35% of the 
queries on the MNIST data timed out after 24 hours, versus 
only roughly 1% of the pairwise mutual-error queries). As 
constituent DNN sizes and ensemble sizes increase, this gap 
in scalability is expected to become even more significant. 

Our verification queries were dispatched using the Marabou 
verification engine [47] (although other engines could also be 
used). Additional details regarding the encoding of the verifi- 
cation queries, as well as detailed experimental results, appear 
in the extended version of this paper [7]. We have publicly 
released our code, as well as all benchmarks and experimental 
data, within an artifact accompanying this paper [6]. 


MNIST. For this part of our evaluation, we trained 10 inde- 
pendent DNNs {Nj,...,.Nio} over the MNIST dataset [51], 
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which includes 28x28 grayscale images of 10 handwritten 
digits (from “0” to “9”). Each of these networks had the same 
architecture: an input layer of 784 neurons, followed by a 
fully-connected layer with 30 neurons, a ReLU layer, another 
fully-connected layer with 10 neurons, and a final softmax 
layer with 10 output neurons, corresponding to the 10 possible 
digit labels.4 All networks achieved high accuracy rates of 
96.29% — 96.57% (see Table I). 

After training, we arbitrarily constructed two distinct en- 
sembles with five DNN members each: €; = {Nj,...,N5} 
and E2 = {Ne,...,Nio}, with an accuracy of 97.8% and 
97.3%, respectively. Notice that the ensembles achieve a 
higher accuracy over the test set than their individual members. 

We then applied our method in an attempt to improve 
the robust accuracy of €,. We began by searching the val- 
idation set, and identifying 200 agreement points (the set 
A),° all correctly labeled as “0” by all 10 networks.° Using 
the 200 agreement points and 6 different perturbation sizes’ 
e€ € {0.01, 0.02, 0.03, 0.04, 0.05, 0.06}, we constructed 1200 
e-balls around the selected agreement points; and then, for 
every ball B and for every pair N;,N; E€ E1, we encoded 
a verification query to check whether N; and N; have a 
mutual error in B (see example in Fig. 3). This resulted in 
(5) -200-6 = 12000 verification queries, which we dispatched 
using the Marabou DNN verifier [47] (each query ran with a 
2-hour timeout limit). Finally, we used the results to compute 
the uniqueness score for each network in €,; these results, 
which appear briefly in Table I (for e = 0.02) and appear in 
full in [7], clearly show that two of the members, N and 
Ns, are each relatively prone to erring simultaneously with 
the remaining four members of £4. 

Next, we began searching among the remaining networks, 
Ne, ..., Nio, for good replacements for Na and N5. Specifi- 
cally, we searched for networks that obtained higher US scores 
than Nə and N5. To achieve this, we began modifying E1, each 
time removing either N2 or Ns, replacing them with one of the 
remaining networks, and computing the uniqueness scores for 
the new members (with respect to the four remaining original 
networks). We observed that for both Nə and Ns, network No 
was a good replacement, obtaining very high US values. For 
additional details, see the extended version of our paper [7]. 

Finally, to evaluate the effect of our changes to 
E, we constructed the two new ensembles, £€?7° 
{M, No, N3, Na, Ns} and ETR, = {M, No, Ns, Na, No}. 
Computing the new ensembles’ robust accuracy over the entire 


4 Although the DNNs all have the same size and architecture, common 
ensemble training processes randomly initialize their weights, and also ran- 
domly pick samples from the same training set (see [50]). This is the cause 
for diversity among ensemble members, which our algorithm later detects. 

5In our experiments, we empirically selected 200 agreement points in order 
to balance between precision (a higher number of points) and verification 
speed (a smaller number of points). This selection is based on a user’s 
available computing power. 

©The “0” label is the label with the highest accuracy among the trained 
ensemble members, and thus “0”-labeled agreement points represent areas in 
the input space with extremely high consensus. 

Te values which are too small, or too large, render the queries trivial. Thus, 
we found it to be useful to use a varied selection of e values. 


TABLE I: Accuracy and uniqueness scores 
ensemble (either €; or €2). 


for the MNIST networks. Uniqueness scores are measured with respect to the 


E1 E2 
Ni N2 N3 Na Ns Ne N7 Ng No Nio 
Accuracy 96.42% 96.55% 96.40% 96.46% 96.29% 96.44% 96.48% 96.57% 96.51% 96.46% 
US 90.75% 88.38% 90.63% 92.13% 88.63% 97.38% 96.75% 97.5% 98.88 % 97.75% 
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Fig. 3: Checking whether two MNIST digit recognition net- 
works have a mutual error around an agreement point labeled 
“9”. In this case, the same perturbation causes one network to 
output the incorrect label “2”, and the other network to output 
the incorrect label “7”. 


test set is computationally expensive, and thus we sampled 200 
random points from the test set (these did not necessarily have 
the same label, nor were they required to be agreement points 
for the ensemble). For each sample, we created a verification 
query to check the robust accuracy of the new ensembles 
around the point, compared to the original ensemble. The 
results are plotted in Fig. 4, and indicate that the new ensem- 
bles demonstrated significantly higher robust accuracy on the 
tested points. These results validate our claim that a scoring 
metric based on agreement points is useful in improving the 
ensemble’s robustness also on other, “harder”, input points. 
Our analysis also indicates that the improved robustness results 
originated not only from e-balls around inputs labeled as “0”, 
but from other labels as well. In fact, the gain in robustness 
was not just in quantity, but also in quality: for almost all cases, 
whenever £; proved robust around an input, so did €?*° and 
€; °°. This indicates that the improved robustness originated 
from inputs on which E, was prone to err. 

Next, we turned our attention to E2, and computed the 
uniqueness scores for each of its members (see Table I). This 
time we conducted a “reverse” experiment: we identified the 
two best members of Ez, i.e. the two networks that had the 
highest uniqueness scores, and were consequently the least 
prone to err simultaneously. These turned out to be networks 
Nog and Nig. Next, we replaced each of these networks with 
each of the networks {N1, ... , Ns}, in order to identify a net- 
work that, when inserted into E2, achieved a lower score than 
Nog and Nig. N4 turned out to be such a network. We created 
the two modified ensembles, €9~4 = { N6, N7, Ng, Na, Nio} 
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and €3°°4 = {N6, N7, Ns, No, N4}, and compared their 
robust accuracy to that of €2 on 200 random points from 
the test set. The results, depicted in Fig. 4, indicate that 
the ensemble’s robust accuracy decreased significantly, as 
expected. 

In both aforementioned experiments, we also computed 
the accuracy (as opposed to robust accuracy) of the new 
ensembles, by evaluating them over the test set. All new 
ensembles had an accuracy that was on par with that of the 
original ensembles — specifically, within a range of +0.2% 
from the original ensembles’ accuracy. 


Fashion-MNIST. For the second part of our evaluation, 
we trained 10 independent DNNs {Ni1,..., N20} over the 
Fashion-MNIST dataset [91], which includes 28 x28 grayscale 
images of 10 clothing categories (“Coat’, “Dress”, etc.), 
and is considered more complex than the MNIST dataset. 
Each DNN had the same architecture as the MNIST-trained 
DNNs, and achieved an accuracy of 87.05%-87.53% (see 
Table II). We arbitrarily constructed two distinct ensembles, 
Ez = {Nu, ate , Mis} and E4 = {Mie, ante , No0}, with an 
accuracy of 88.22% and 88.48%, respectively. 

Next, we again computed the US values of each of the 
networks. The results, which appear in full in [7], indicate a 
high variance among the uniqueness scores of the members 
of E4, as compared to the relatively similar scores of €3’s 
members. We thus chose to focus on E4. Based on the 
computed US values, we identified No as its least unique 
DNN; and, by replacing N29 with each of the five networks 
not currently in E4, identified that Nı5 is a good candidate 
for replacing N20. Performing our validation step over £2015 
revealed that its robust accuracy has indeed increased. Running 
the “reverse” experiment, in which €4’s most unique member 
is replaced with a worse candidate, led us to consider the 
ensemble €{°~1!3, which indeed demonstrated lower robust 
accuracy than the original ensemble. For additional details, 
see the extended version of our paper [7]. 

For the final step of our experiment, we used our approach 
to iteratively switch two members of an ensemble. Specifically, 
after creating €79~!°, which had higher robust accuracy than 
E4, we re-computed the US scores of its members, and 
identified again the least unique member — in this case, N16. 
Per our computation, the best candidate for replacing it was 
Nı2. The resulting ensemble, namely eee indeed 
demonstrated higher robust accuracy than both its predeces- 
sors. Performing another iteration of the “reverse” experiment 
yielded ensemble E i 8>13,17>11 with poorer robust accuracy. 
The results appear in Fig. 5. We note that the only discrepancy, 
namely the robust accuracy of £207! being lower than that 
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Fig. 4: The average robust accuracy scores for our original and modified ensembles. The results for e = 0.01 and e = 0.06 are 
trivial (the ensembles achieve near-perfect or near-zero robustness), and are omitted to reduce clutter. 


TABLE II: Accuracy and uniqueness scores for the Fashion-MNIST networks. Uniqueness scores are measured with respect 


to the ensemble (either E3 or E4). 


E3 E4 
Nii Niz Ni3 Nia Nis Nie Ni7 Nig Nig N20 
Accuracy | 87.14% 87.13% 87.53% 87.34% 87.3% 87.05% 87.32% 87.35% 87.34% 87.11% 
US 70.63% 71.5% 69.75% 70.88% 73.25% 67.38% 72.38% 80.13% 71.38% 66.75% 
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Fig. 5: The original ensemble €, (center), ensembles modified 
to gain robust accuracy (right), and ensembles modified to 


reduce robust accuracy (left). 


of E4 for e = 0.04, is due to timeouts. 

Similarly to the MNIST case, the new ensembles in the 
Fashion-MNIST experiments obtained an accuracy that was on 
par with that of the original ensembles — specifically, within 
a range of +0.17% from the original ensemble’s accuracy. 


V. COMPARISON TO GRADIENT-BASED ATTACKS 


Current state-of-the-art approaches for assessing a network’s 
robustness and robust accuracy rely on gradient-based attacks 
— a popular class of algorithms that, like verification methods, 
are capable of finding adversarial examples for a given neural 
network. In this section we compare our verification-based 
approach to these methods. 

Gradient-based attacks generate adversarial examples by 
optimizing (via various techniques) a loss metric over the 
network’s output, relative to its input. This allows these 
methods to effectively search the local surroundings of a 
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fixed input point for local optima, which often constitute 
adversarial inputs. Gradient-based methods, such as the fast- 
gradient sign method (FGSM) [39], projected gradient descent 
(PGD) [60], and others [49], [59], are in widespread use due 
to their scalability and relative ease of use. However, as we 
demonstrate here, they are often unsuitable in our setting. 

In order to evaluate the effectiveness of gradient-based 
methods for measuring the robust accuracy of ensembles, we 
modified the common FGSM [39] and I-FGSM [49] (“Iterative 
FGSM”) methods, thus extending them into three novel attacks 
aimed at finding adversarial examples that can fool multiple 
ensemble members simultaneously. We refer to these attacks as 
Gradient Attack (G.A.) 1, 2, and 3. For a thorough explanation 
of these attacks, as well as information about their design and 
implementation, see the extended version of our paper [7]. 

Next, we used our three attacks to search for mutual errors 
of DNN pairs — i.e., adversarial examples that simultaneously 
affect a pair of DNNs. Specifically, we applied the attacks on 
both datasets (MNSIT and Fashion-MNIST), and searched for 
adversarial examples within various ¢-balls around the same 
set of agreement points used in our previous experiments. 
This allowed us to subsequently compute, via gradient attacks, 
the mutual error scores of DNN pairs, and consequently, 
the uniqueness scores of each constituent ensemble member. 
The results of the total number of adversarial inputs found 
(SAT queries) are summarized in Table II. Each gradient 
attack typically took a few seconds to run. We also provide 
further details regarding the uniqueness scores computed by 
the three gradient-based methods in the extended version of 
this paper [7], and in our accompanying artifact [6]. 

The results in Table III include a total of 108000 exper- 
iments, on all ensemble pairs.® In these experiments, our 


8The 108000 experiments consist of E) pairs, times 200 agreement 


points, times 6 perturbation sizes, times 2 datasets. 


TABLE III: The number of SAT queries discovered when 
searching for an adversarial attack, using the three gradient 
attack methods (G.A. 1, 2 and 3), and our verification ap- 
proach. 


Experiment GA.1 GA.2 GA.3 verification 
MNIST 1,333 3,886 5,574 16,826 
Fashion-MNIST 17,190 21,245 22,129 33,152 
Total 18,523 25,131 27,703 49,978 


verification-based approach returned 49978 SAT results, while 
the strongest gradient-based method (gradient attack number 
3) returned only 27703 SAT results — a 44% decrease in 
the number of counterexamples found. This discrepancy is on 
par with previous research [89], which indicates that gradient- 
based methods may err significantly when used for adversarial 
robustness analysis. This phenomenon manifests strongly in 
our setting, which involves many small and medium-sized per- 
turbations that gradient-based approaches struggle with [24]. 

The reduced precision afforded by gradient-based ap- 
proaches can, in some cases, lead to sub-optimal ensemble 
selection choices when compared to our verification-based 
approaches. Specifically, even if a gradient-based approach 
produces a uniqueness score ranking that coincides with the 
one produced using verification, the dramatic decrease in the 
number of SAT queries leads to much smaller mutual error 
scores, and consequently — to uniqueness score values that are 
overly optimistic, and less capable of distinguishing between 
poor and superior robust accuracy results. 

For example, when observing the first two arbitrary ensem- 
bles on the MNIST dataset, €, and E2, the three gradient 
approaches (G.A. 1, 2 and 3) respectively assign average 
uniqueness scores of (95.4%, 97.8%), (87.5%, 94.5%) and 
(83.1%, 92.5%) to the two ensembles (when averaging the 
US over all ensemble members and all perturbations). This 
indicates that the robust accuracy of the two ensembles is 
fairly similar (see appendices in [7]). In contrast, when using 
the more sensitive, verification-based approach, we find a 
substantially higher number of mutual errors (see Table IID, 
and consequently, detect a much larger gap between the 
uniqueness scores of the two ensembles: 55% and 77%. 

Another example that demonstrates the increased sensitivity 
of our method, when compared to gradient-based approaches, 
is obtained by observing the average uniqueness score of 
Ez and E4 on the Fashion-MNIST dataset. The strongest 
gradient attack that we used assigned almost identical average 
uniqueness scores to both ensembles (up to a difference of 
0.01%), while our approach was sensitive enough to find a 
2% difference between the average US of the two ensembles. 

Finally, we note that, unlike verification-based approaches, 
gradient attacks are incomplete, and are consequently unable 
to return UNSAT. This makes them less suitable for assessing 
any additional uniqueness metrics based on robust e-balls. We 
thus argue that, although gradient-based methods are faster 


34 


and more scalable than verification, our results showcase the 
benefits of using verification-based approaches for assessing 
uniqueness scores and for ensemble selection. 


VI. RELATED WORK 


Due to its pervasiveness, the phenomenon of adversarial 
inputs has received a significant amount of attention [27], 
[34], [61], [65], [66], [80], [99]. More specifically, the ma- 
chine learning community has put a great deal of effort into 
measuring and improving the robustness of networks [18]- 
[20], [29], [36], [54], [60], [68], [71], [72], [87], [94]. The 
formal methods community has also been looking into the 
problem, by devising scalable DNN verification, optimization 
and monitoring techniques [1], [5], [8], [10]-[12], [16], [26], 
[41], [42], [55], [56], [64], [67], [70], [76], [90], [96]. To the 
best of our knowledge, ours is the first attempt to apply DNN 
verification to the setting of DNN ensembles. We note that our 
approach uses a DNN verifier strictly as a black-box backend, 
and so its scalability will improve as DNN verifiers become 
more scalable. 

Obtaining DNN specifications to be verified is a difficult 
problem. While some studies have successfully applied verifi- 
cation to properties formulated by domain-specific experts [3], 
[4], [22], [25], [45], [78], most research has been focusing on 
universal properties, which pertain to every DNN-based sys- 
tem; specifically, local adversarial robustness [17], [35], [58], 
[76], fairness properties [83], network simplification [31] and 
modification [23], [32], [69], [77], [84], [93], and watermark 
resilience [32]. 


VII. CONCLUSION AND FUTURE WORK 


In this case-study paper, we demonstrate a novel technique 
for assessing a deep ensemble’s robust accuracy through the 
use of DNN verification. To mitigate the difficulty inherent 
to verifying large ensembles, our approach considers pairs of 
networks, and computes for each ensemble member a score 
that indicates its tendency to make the same errors as other en- 
semble members. These scores allow us to iteratively improve 
the robust accuracy of the ensemble, by replacing weaker 
networks with stronger ones. Our empiric evaluation indicates 
the high practical potential of our approach; and, more broadly, 
we view this work as a part of the ongoing endeavor for 
demonstrating the real-world usefulness of DNN verification, 
by identifying additional, universal, DNN specifications. 

Moving forward, we plan to tackle the natural open ques- 
tions raised by our work; specifically, how our methodology 
for selecting robustly accurate ensembles can be extended 
beyond the current greedy search heuristic, as well as how 
ensembles should be selected in the context of other per- 
formance objectives, beyond robust accuracy. We also plan 
on experimenting with multiple stopping conditions for the 
ensemble member replacement process; as well as explore 
potential synergies between our verification-based approach 
and gradient-based approaches for computing mutual error 
scores. In addition, we note that we are currently extending 


our approach to regression learning ensembles and deep rein- 
forcement learning ensembles. Finally, we are in the process of 
optimizing our approach by using lighter-weight, incomplete 
verification tools (e.g., [76], [88], [95]), which afford better 
scalability, and also support parallelization. This will hope- 
fully allow us to handle significantly larger DNNs and more 
complex datasets. 
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Abstract—Deep neural networks (DNNs) are increasingly being 
employed in safety-critical systems, and there is an urgent need to 
guarantee their correctness. Consequently, the verification com- 
munity has devised multiple techniques and tools for verifying 
DNNs. When DNN verifiers discover an input that triggers an 
error, that is easy to confirm; but when they report that no 
error exists, there is no way to ensure that the verification 
tool itself is not flawed. As multiple errors have already been 
observed in DNN verification tools, this calls the applicability 
of DNN verification into question. In this work, we present a 
novel mechanism for enhancing Simplex-based DNN verifiers 
with proof production capabilities: the generation of an easy-to- 
check witness of unsatisfiability, which attests to the absence of 
errors. Our proof production is based on an efficient adaptation 
of the well-known Farkas’ lemma, combined with mechanisms 
for handling piecewise-linear functions and numerical precision 
errors. As a proof of concept, we implemented our technique on 
top of the Marabou DNN verifier. Our evaluation on a safety- 
critical system for airborne collision avoidance shows that proof 
production succeeds in almost all cases and requires only minimal 
overhead. 


I. INTRODUCTION 


Machine learning techniques, and specifically deep neural 
networks (DNNs), have been achieving groundbreaking re- 
sults in solving computationally difficult problems. Nowadays, 
DNNs are state-of-the-art tools for performing many safety- 
critical tasks in the domains of healthcare [29], aviation [45] 
and autonomous driving [19]. DNN training is performed by 
adjusting the parameters of a DNN to mimic a highly complex 
function over a large set of input-output examples (the training 
set) in an automated way that is mostly opaque to humans. 

The Achilles heel of DNNs typically lies in generalizing 
their predictions from the finite training set to an infinite input 
domain. First, DNNs tend to produce unexpected results on 
inputs that are considerably different from those in the training 
set; and second, the input to the DNN might be perturbed 
by sensorial imperfections, or even by a malicious adversary, 
again resulting in unexpected and erroneous results. These 
weaknesses have already been observed in many modern 
DNNs [37], [64], and have even been demonstrated in the 
real world [30] — thus hindering the adoption of DNNs in 
safety-critical settings. 

In order to bridge this gap, in recent years, the formal 
methods community has started devising techniques for DNN 
verification (e.g., [2], [11], [13], [31], [32], [40], [41], [53], 
[58], [61], [62], [66], [68], [73], among many others). Typi- 
cally, DNN verification tools seek to prove that outputs from a 
given set of inputs are contained within a safe subspace of the 
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output space, using various methods such as SMT solving [1], 
[16], [23], abstract interpretation [32], MILP solving [65], and 
combinations thereof. Notably, many modern approaches [50], 
[53], [55], [65] involve a search procedure, in which the 
verification problem is regarded as a set of constraints. Then, 
various input assignments to the DNN are considered in order 
to discover a counter-example that satisfies these constraints, 
or to prove that no such counter-example exists. 

Verification tools are known to be as prone to errors as 
any other program [44], [72]. Moreover, the search procedures 
applied as part of DNN verification typically involve the 
repeated manipulation of a large number of floating-point 
equations; this can lead to rounding errors and numerical 
stability issues, which in turn could potentially compromise 
the verifier’s soundness [12], [44]. When the verifier discovers 
a counter-example, this issue is perhaps less crucial, as the 
counter-example can be checked by evaluating the DNN; but 
when the verifier determines that no counter-example exists, 
this conclusion is typically not accompanied by a witness of 
its correctness. 

In this work, we present a novel proof-production mech- 
anism for a broad family of search-based DNN verification 
algorithms. Whenever the search procedure returns UNSAT 
(indicating that no counter-example exists), our mechanism 
produces a proof certificate that can be readily checked using 
simple, external checkers. The proof certificate is produced 
using a constructive version of Farkas’ lemma, which guaran- 
tees the existence of a witness to the unsatisfiability of a set 
of linear equations — combined with additional constructs 
to support the non-linear components of a DNN, i.e., its 
piecewise-linear activation functions. We show how to instru- 
ment the verification algorithm in order to keep track of its 
search steps, and use that information to construct the proof 
with only a small overhead. 

For evaluation purposes, we implemented our proof- 
production technique on top of the Marabou DNN verifier [50]. 
We then evaluated our technique on the ACAS Xu set of 
benchmarks for airborne collision avoidance [46], [48]. Our 
approach was able to produce proof certificates for the safety 
of various ACAS Xu properties with reasonable overhead 
(5.7% on average). Checking the proof certificates produced 
by our approach was usually considerably faster than dispatch- 
ing the original verification query. 

The main contribution of our paper is in proposing a 
proof-production mechanism for search-based DNN verifiers, 
which can substantially increase their reliability when de- 
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termining unsatisfiability. However, it also lays a foundation 
for a conflict-driven clause learning (CDCL) [74] verification 
scheme for DNNs, which might significantly improve the 
performance of search-based procedures (see discussion in 
Sec. IX). 

The rest of this paper is organized as follows. In Sec. II 
we provide relevant background on DNNs, formal verification, 
the Simplex algorithm, and on using Simplex for search-based 
DNN verification. In Sec. III, IV and V, we describe the proof- 
production mechanism for Simplex and its extension to DNN 
verification. Next, in Sec. VI, we briefly discuss complexity- 
theoretical aspects of the proof production. Sec. VII details our 
implementation of the technique and its evaluation. We then 
discuss related work in Sec. VIII and conclude with Sec. IX. 


II. BACKGROUND 


Deep Neural Networks. Deep neural networks (DNNs) [36] 
are directed graphs, whose nodes (neurons) are organized into 
layers. Nodes in the first layer, called the input layer, are 
assigned values based on the input to the DNN; and then 
the values of nodes in each of the subsequent layers are 
computed as functions of the values assigned to neurons in 
the preceding layer. More specifically, each node value is 
computed by first applying an affine transformation to the 
values from the preceding layer and then applying a non-linear 
activation function to the result. The final (output) layer, which 
corresponds to the output of the network, is computed without 
applying an activation function. 

One of the most common activation functions is the rectified 
linear unit (ReLU), which is defined as: 


b b>0 
0 otherwise. 


f(b) = ReLU(b) = i 


When b > 0, we say that the ReLU is in the active phase; 
otherwise, we say it is in the inactive phase. For simplicity, 
we restrict our attention here to ReLUs, although our approach 
could be applied to other piecewise-linear functions (such as 
max pooling, absolute value, sign, etc.). Non piecewise-linear 
functions, such as as sigmoid or tanh, are left for future work. 

Formally, a DNN V : R™ — RF, is a sequence of n layers 
Lo, ..., Ln—1 where each layer L; consists of s; € N nodes, 
denoted v}, ..., vst. The assignment for the jt” node in the 
1 <i < n-— 1 layer is computed as 


Si-1 
uf = ReLU (Soe ola +l) 
l=1 
and neurons in the output layer are computed as: 
Sn—2 
v,- = 5 Wn—1,j,l ` Un—2 + phi 
i=1 
where wi jı and pi are (respectively) the predetermined 
weights and biases of M. We set so = m and treat vå, ... 
as the input of M. 
A simple DNN with four layers appears in Fig. 1. For 
simplicity, the p? parameters are all set to zero and are ignored. 


m 
; U0 


©, 
\relu ReLU 
@—0-@ 
F =9 1 
~ 


Fig. 1: A toy DNN. 


For input (1,2), the node in the second layer evaluates to 
ReLU(1-1 + 2-(—1)) = ReLU(—1) = 0; the node in the 
third layer evaluates to ReLU(0 - (—2)) = 0; and the node in 
the fourth (output) layer evaluates to 0-1 = 0. 


DNN Verification and Proofs. Given a DNN N : R™ — R* 
and a property P : R™+* — {T,F}, the DNN verification 
problem is to decide whether there exist x € R™ and y € R* 
such that (N (x) = y)AP(a, y) holds. If such x and y exist, we 
say that the verification query (M, P) is satisfiable (SAT); and 
otherwise, we say that it is unsatisfiable (UNSAT). For exam- 
ple, given the toy DNN from Fig. 1, we can define a property 
P: P(x,y) & (a € [2,3] x [-1,1]) A (y € [0.25, 0.5]). Here, 
P expresses the existence of an input x € [2,3] x [—1, 1] that 
produces an output y € [0.25, 0.5]. Later on, we will prove 
that no such x exists, i.e., the verification query (NV, P) is 
UNSAT. 

Typically, P represents the negation of a desired property, 
and so an input x which satisfies the query is a counter- 
example — whereas the query’s unsatisfiability indicates that 
the property holds. In this work, we follow mainstream DNN 
verification research [53], [68] and focus on properties P that 
are a conjunction of linear lower- and upper-bound constraints 
on the neurons of x and y. It has been shown that even 
for such simple properties, and for DNNs that use only the 
ReLU activation function, the verification problem is NP- 
complete [48]. 

A proof is a mathematical object that certifies a mathemat- 
ical statement. In case a DNN verification query is SAT, the 
input x for which P holds constitutes a proof of the query’s 
satisfiability. Our goal here is to generate proofs also for the 
UNSAT case, which, to the best of our knowledge, is a feature 
that no DNN verifier currently supports [12]. 


Verifying DNNs via Linear Programming. Linear Program- 
ming (LP) [22] is the problem of optimizing a linear function 
over a given convex polytope. An LP instance over variables 
V = |z1,..., £n]! € R” contains an objective function c- V 
to be maximized, subject to the constraints A.V = b for 
some A € Mmxn(R),b € R”, andl < V < u for some 
l,u € (RU{00})”. Throughout the paper, we use l(x;) and 
u(a;), to refer to the lower and upper bounds (respectively) 
of xi. LP solving can also be used to check the satisfiability 
of constraints of the form (A-V = b)A (I< V <u). 

The Simplex algorithm [22] is a widely used technique 
for solving LP instances. It begins by creating a tableau, 
which is equivalent to the original set of equations AV = b. 
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Next, Simplex selects a certain subset of the variables, B C 

{x1,...,2n}, to act as the basic variables; and the tableau 

is considered as representing each basic variable x; € B as 

a linear combination of non-basic variables, x; 5 Cy £5. 
TB 


We use Aij to denote the coefficient of a variable zj in the 
tableau row that corresponds to basic variable x;. Apart from 
the tableau, Simplex also maintains a variable assignment that 
satisfies the equations of A, but which may temporarily violate 
the bound constraints 1 < V < u. The assignment for a 
variable x; is denoted a(z;). 


After initialization, Simplex begins searching for an as- 
signment that simultaneously satisfies both the tableau and 
bound constraints. This is done by manipulating the set B, 
each time swapping a basic and a non-basic variable. This 
alters the equations of A by adding multiples of equations 
to other equations, and allows the algorithm to explore new 
assignments. The algorithm can terminate with a SAT answer 
when a satisfying assignment is discovered or an UNSAT 
answer when: (i) a variable has contradicting bounds, i.e., 
l(xi) > u(a;); or (ii) one of the tableau equations x; = 
>> cj- xj implies that x; can never satisfy its bounds. The 
jB 
To algorithm is sound, and is also complete if certain 
heuristics are used for selecting the manipulations of 6 [22]. 
A detailed calculus for the version of Simplex that we use 
appears in the extended version of this paper [42]. 


LP solving is particularly useful in the context of DNN 
verification, and is used by almost all modern tools (either na- 
tively [48], or by invoking external solvers such as GLPK [54] 
or Gurobi [39]). More specifically, a DNN verification query 
can be regarded as an LP instance with bounded variables 
that represents the property P and the affine transformations 
within M, combined with a set of piecewise-linear constraints 
that represent the activation functions. We demonstrate this 
with an example, and then explain how this formulation can 
be solved. 

Recall the toy DNN from Fig. 1, and property P that is 
used for checking whether there exists an input x in the range 
[2,3] x [—1, 1] for which M produces an output y in the range 
(0.25, 0.5]. We use b1, fı to denote the input and output to 
node v1; b2, fo for the input and output of v2; xı and x2 to 
denote the network’s inputs, and y to denote the network’s 
output. The linear constraints of the network yield the linear 
equations bı = x, — Xo, b2 = —2fı, and y = f2 (which 
we name el, e’, and e, respectively). The restrictions on the 
network’s input and output are translated to lower and upper 
bounds: 2 < zı < 3, —1 < z2 < 1, 0.25 < y < 0.5. The third 
equation implies that 0.25 < fə < 0.5, which in turn implies 
that bə < 0.5. Assume we also restrict: —0.5 < b2, —0.5 < 
bı < 0.5, 0 < fı < 0.5, . Together, these constraints give rise 
to the linear program that appears in Fig. 2. The remaining 
ReLU constraints, i.e. f; = ReLU(b;) for i € {1,2}, exist 
alongside the LP instance. Together, query y is equivalent to 
the DNN verification problem that we are trying to solve. 


Using this formulation, the verification problem can be 


40 


Piecewise-Linear 


Fig. 2: An example of a DNN verification query y, comprised 
of an LP instance and piecewise-linear constraints. 


solved using Simplex, enhanced with a case-splitting approach 
for handling the ReLU constraints [17], [48]. Intuitively, we 
first invoke the LP solver on the LP portion of the query; and 
if it returns UNSAT, the whole query is UNSAT. Otherwise, 
if it finds a satisfying assignment, we check whether this 
assignment also satisfies the ReLU constraints. If it does, 
then the whole query is SAT. Otherwise, case splitting is 
applied in order to split the query into two different sub- 
queries, according to the two phases of the ReLU function.! 
Specifically, in one of the sub-queries, the LP query is adjusted 
to enforce the ReLU to be in the active phase: the equation 
f = bis added, along with the bound b > 0. In the other sub- 
query, the inactive phase is enforced: b < 0,0 < f < 0. This 
effectively reduces the ReLU constraint into linear constraints 
in each sub-query. This process is then repeated for each of 
the two sub-queries. 

Case-splitting turns the verification procedure into a search 
tree [48], with nodes corresponding to the splits that were ap- 
plied. The tree is constructed iteratively, with Simplex invoked 
on every node to try and derive UNSAT or find a true satisfying 
assignment. If Simplex is able to deduce that all leaves in 
the search tree are UNSAT, then so is the original query. 
Otherwise, it will eventually find a satisfying assignment that 
also satisfies the original query. This process is sound, and 
will always terminate if appropriate splitting strategies are 
used [22], [48]. Unfortunately, the size of the search tree 
can be exponential in the number of ReLU constraints; and 
so in order to keep the search tree small, case splitting is 
applied as little as possible, according to various heuristics that 
change from tool to tool [55], [62], [68]. In order to reduce 
the number of splits even further, verification algorithms apply 
clever deduction techniques for discovering tighter variable 
bounds, which may in turn rule out some of the splits a-priori. 
We also discuss this kind of deduction, which we refer to as 
dynamic bound tightening, in the following sections. 


III. PROOF PRODUCTION OVERVIEW 
A Simplex-based verification process of a DNN is tree- 
shaped, and so we propose to generate a proof tree to match 


'The approach is easily generalizable to any piecewise-linear constraint, by 
splitting the query according to the different linear pieces of the activation 
function. 


it. Within the proof tree, internal nodes will correspond to 
case splits, whereas each leaf node will contain a proof of 
unsatisfiability based on all splits performed on the path 
between itself and the root. Thus, a proof tree constitutes a 
valid proof of unsatisfiability if each of its leaves contains 
a proof that demonstrates that all splits so far lead to a 
contradiction. The proof tree might also include proofs for 
lemmas, which are valid statements for the node in which they 
reside and its descendants (lemmas are needed for supporting 
bound tightening, as we discuss later). 

As a simple, intuitive example, we depict in Fig. 3 a proof 
of unsatisfiability for the query y from Fig. 2. The root of 
the proof tree represents the initial verification query, which 
is comprised of LP constraints and ReLU constraints. The 
fact that this node is not a leaf indicates that the Simplex- 
based verifier was unable to conclude UNSAT in this state, 
and needed to perform a case split on the ReLU node vı. The 
left child of the root corresponds to the case where ReLU v; is 
inactive: the LP is augmented with additional constraints that 
represent the case split, i.e., fı = 0 and bı < 0. This new fact 
may now be used by the Simplex procedure, which is indeed 
able to obtain an UNSAT result. The node then contains a proof 
of this unsatisfiability: [-1 0 0] T. This vector instructs the 
checker how to construct a linear combination of the current 
tableau’s rows, in a way that leads to a bound contradiction, 
as we later explain in Sec. V. 


. ; p : 
vı inactive vı active 


EZ A^ (bı < 0) pr ( (fr = b1) A (bı > 0) 


-1 0 0)" 
vg inactive v2 active 


pn Gi = bi) ) A (bı = 0) pA (fi = b1) A (01 = 0) 
A(f2 = 0) A (b2 < 0) A(f2 = b2) A (b2 > 0) 


fe [-2 1 0 -2 o0ļ' 


Fig. 3: A proof tree example. 


In the right child of the root, which represents v,’s active 
phase, the constraints fı = bı and bı > 0 are added by the 
split. This node is not a leaf, because the verifier performed a 
second case split, this time on v2. The left child represents 
v2’s inactive phase, and has the corresponding constraints 
f2 = 0 and bə < 0. This child is a leaf, and is marked 
with f2, indicating that fọ is a variable whose bounds led 
to a contradiction. Specifically, f2 > 0.25 from y and f2 = 0 
from the case split are contradictory. 

The last node (the rightmost leaf) represents v2’s active 
phase, and has the constraints fọ = bz and bo > 0. Here, 
the node indicates that a contradiction can be reached from 
the current tableau, using the vector [-2 1 0 —2 ol, 
In Sec. IV, we explain how this process works. 

Because each leaf of the proof tree contains a proof of 
unsatisfiability, the tree itself proves that the original query 
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is UNSAT. Note that many other proof trees may exist for 
the same query. In the following sections, we explain how to 
instrument a Simplex-based verifier in order to extract such 
proof trees from the solver execution. 


IV. SIMPLEX WITH PROOFS 
A. Producing proofs for LP 


We now describe our approach for creating proof trees, 
beginning with leaf nodes. We start with the following lemma: 


Lemma 1. /f Simplex returns UNSAT, then there exists a 
variable with contradicting bounds; that is, there exists a 
variable x; € V with lower and upper bounds l(x;) and u(x), 
for which Simplex has discovered that I(x) > u(x;). 


This lemma justifies our choice of using contradicting 
bounds as proofs of unsatisfiability in the leaves of the proof 
tree. The lemma follows directly from the derivation rules 
of Simplex. Specifically, there are only two ways to reach 
UNSAT: when the input problem already contains inconsistent 
bounds l(x;) > u(a;), or when Simplex finds a tableau row 
zi = > c- x; that gives rise to such inconsistent bounds. 


The papie proof appears in the extended version of this 
paper [42]. 

We demonstrate this with an example, based on the query p 
from Fig. 2. Suppose that, as part of its Simplex-based solution 
process, a DNN verifier performs two case splits, fixing the 
two ReLUs to their active states: fı = bı ^ bı > 0 and fọ = 
b2^bə2 > 0. This gives rise to the following (slightly simplified) 
system of equations: 


b=%1-22 b=-2f. y=fhe fi=bi fo=be 
Which corresponds to the tableau and variables 
Ti T 
1 -1 -1 0 0 0 0 T2 
0 0 0 -1 -2 0 0 by 
A=]0 0 0 0 0 1 -l V = | be 
0 0 1 0 -1 0 0 fi 
0 0 0 1 0 —1 0 fo 
y 


such that AV = 0, with the corresponding bound vectors: 


l=[2 =1 0 0 0 025 0.25] 
=[3 1 0.5 0.5 0.5 0.5 0.5] 


Then, the Simplex solver iteratively alters the set of basic 
variables, which corresponds to multiplying various equations 
by scalars and summing them to obtain new equations. At 
some point, the equation bọ = —2gzı + 2x2 is obtained (by 
computing [-2 1 0 -2 0]7 .- A- V), with a current 
assignment of a(V)T = [2 1 1 —2 1 —2 —2]. 

At this point, the Simplex solver halts with an UNSAT 
notice. The reason is that bz is currently assigned the value 
—2, which is below its lower bound of 0, and so its value 
needs to be increased. However, the equation, combined with 
the fact that x; is pressed against its lower bound, while x2 is 


pressed against its upper bound, indicates that there is no slack 
remaining in order to increase the value of bz (this corresponds 
to the Failure; rule in the Simplex calculus described in the 
extended version of this paper [42]). The key point is that the 
same equation could be used in deducing a tighter bound for 
ba: 


and a contradiction could then be obtained based on the 
contradictory facts 0 = I(b2) < b2 < —2. In other words, and 
as we formally prove in the extended version of this paper [42], 
any UNSAT answer returned by Simplex can be regarded as a 
case of conflicting lower and upper bounds. 

Given Lemma 1, our goal is to instrument the Simplex 
procedure so that whenever it returns UNSAT, we are able to 
produce a proof which indicates that I(x;) > u(x;) for some 
variable x;. To this end, we introduce the following adaptation 
of Farkas’ Lemma [67] to the Simplex setting, which states 
that a linear-sized proof of this fact exists. 


Lemma 2. Given the constraints A: V = 0 andl < V < u, 
where A € Mmxn(R) and 1,V,u € R”, exactly one of these 
two options holds: 
1) The SAT case: 3V € R” such that A- V = 0 and l < 
V <u. 
2) The UNSAT case: Jw € R™ such that for alll < V < u, 
wT. A-V <0, whereas 0- w = 0. Thus, w is a proof of 
the constraints? unsatisfiability. 


Moreover, these vectors can be constructed during the run of 
the Simplex algorithm. 


This Lemma is actually a corollary of Theorem 3, which we 
introduce later. For a complete proof, see the extended version 
of this paper [42]. 

In our previous, UNSAT example, one possible vector is 
w=([-2 1 0 —2 OJ". Indeed, w- A-V = 0 gives us 
the equation —2x1 +2x£2 — b2 = 0. Given the lower and upper 
bounds for the participating variables, the largest value that 
the left-hand side of the equation can obtain is: 


Therefore, no variable assignment within the stated bounds can 
satisfy the equation, indicating that the constraints are UNSAT. 

Given Lemma 2, all that remains is to instrument the 
Simplex solver in order to produce the proof vector w on 
the fly, whenever a contradiction is detected. In case a trivial 
contradiction I(x;) > u(x;) is given as part of the input 
query for some variable x;, we simply return “x;” as the 
proof (we later discuss also how to handle this case in the 
presence of dynamic bound tightenings). Otherwise, a non- 
trivial contradiction is detected as a result of an equation 
e = ti = Ve - £j, which contradicts one of the input 
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bounds of x;. In this case, no assignment can satisfy the 

equivalent equation J` cj: xj — x; = 0. Since the Simplex 
j¢B 

algorithm applies only linear operations to the input tableau, 


e is given by a linear combination of the original tableau rows. 
Let coef (e) denote the Farkas vector of the equation e, i.e., the 
column vector such that coef(e)™ - A = e, and which proves 
unsatisfiability in this case. Our framework simply keeps track, 
for each row of the tableau, of its coefficient vector; and if that 
row leads to a contradiction, the vector is returned. 


B. Supporting dynamic bound tightening 


So far, we have only considered Simplex executions that do 
not perform any bound tightening steps; i.e., derive UNSAT 
by finding a contradiction to the original bounds. However, in 
practice, modern DNN solvers perform a great deal of dynamic 
bound tightening, and so this needs to be reflected in the proof. 

We use the term ground bounds to refer to variable bounds 
that are part of the LP being solved, whether they were 
introduced by the original input, or by successive case splits, 
as we will explain in Sec. V. This is opposed to dynamic 
bounds, which are bounds introduced on the fly, via bound 
tightening. The ground bounds, denoted l,u € R”, are used 
in explaining dynamic bounds, denoted I’, u’ € R”, via Farkas 
vectors. 

For simplicity, we consider here a simple and popular 
version of bound tightening, called interval propagation [25], 


[48]. Given an equation x; = >> cj; x; and current bounds 
j€B 

l (x) and u’ (a) for each of the variables (whether these are the 

ground bounds or dynamically tightened bounds themselves), 


a new upper bound for x; can be derived: 


>, me 


«;¢B,cj;>0 xj3¢B, cj <0 


u (x;) := cj- u (xj) + Goes) (0) 
(provided that the new bound is tighter, i.e., smaller, than the 
current upper bound for x;). A symmetrical version exists for 
discovering lower bounds. 

A naive approach for handling bound tightening is to store, 
each time a new bound is discovered, a separate proof that 
justifies it; for example, a Farkas vector for deriving the 
equation that was used in the bound tightening. However, 
a Simplex execution can include many thousands of bound 
tightenings — and so doing this would strain resources. Even 
worse, many of the intermediate bound tightenings might not 
even participate in deriving the final contradiction, and so 
storing them would be a waste. 

In order to circumvent this issue, we propose a scheme in 
which we store, for each variable in the query, a single column 
vector that justifies its current lower bound, and another for its 
current upper bound. Whenever a tighter bound is dynamically 
discovered, the corresponding vector is updated; and even if 
other, previously discovered dynamic bounds were used in the 
derivation, the vector that we store indicates how the same 
bound can be derived using the ground bounds. Thus, the proof 
of the tightened bounds remains compact, regardless of the 
number of derived bounds; specifically, it requires only O(n - 
m) space overall. Formally, we have the following result: 


Theorem 3. Let A: V = Q such that |< V < u be an LP 
instance, where A E€ Mm xn(R) and l, V, u € R”. 
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Let u',l’ € R” represent dynamically tightened bounds of 
V. Then Vi € |n] Sfu(xi), filai) € R™ such that faulx)" - A 
and fi(a;)-A can be used to efficiently compute u’ (x;), U (x;) 
from l and u. Moreover, vectors f,,(a;) and fi(x;) can be 
constructed during the run of the Simplex algorithm. 


When a Simplex procedure with bound tightening reaches 
an UNSAT answer, it has discovered a variable x; with 
l (xi) > u’(a;). The theorem guarantees that in this case we 
have two column vectors, f.,(a;) and fi(x;), which explain 
how u’(a;) and l'(x;) were discovered. We refer to these 
vectors as the Farkas vectors of the upper and lower bounds of 
zi, respectively. Because u’(x;)—l’(x;) is negative, the column 
vector w = fu(x;) — fi(a;) creates a tableau row which is 
always negative, making w € R™ a proof of unsatisfiability. 
The formal, constructive proof of the theorem appears in the 
extended version of this paper [42]. 

In order to maintain f,,(x;) and f;(2;) during the execution 
of Simplex, whenever a tigher upper bound is tightened using 
Eq. 1, we update the matching Farkas vector: 


Fults) = XO falei) + XO c: Ales) + coef(e), 


j#i,cj >0 j#i,cj <O 


where e is the linear equation used for tightening, and coef (e) 
is the column vector such that coef(e)™- A = e. The lower 
bound case is symmetrical. To demonstrate the procedure, 
consider again the verification query from Fig. 2. Assume 
the phases of v,,v2 have both been set to active, and that 
consequently two new equations have been added: e* : fı = 
bi, eœ: f2 = by. In this example, we have five linear 
equations, so we initialize a zero vector of size five for each of 
the variable bounds. Now, suppose Simplex tightens the lower 
bound of b; using the first equation e!: 


l' (bi) := Ua) — u(we) = 2-1=1 


and thus we update 


Filbi) = fix) — fuly) + coef (e") 
=[0 0 o o 0)" 10 0 0o o of 
+[1 0 0 0 of 
= i 0 0 0 oļ' 


since all fı and fu vectors have been initialized to 0 and 
coef(e)= {1 0 0 0 o}" — which indicates that e! is 
simply the first row of the tableau. 
We can now tighten bounds again, using the fourth row 
fi = bı, and get l’(f,) := l (b1) = 1. We update fi(fi): 
fi(fi) = Filbi) + coef (e*) 
={1 0 0 0 o/'+[0 0 0 1 Ql! 
=j 0 0 1 0 
To see that the Farkas vector can indeed explain the dy- 
namically tightened bound, observe that the combination 


[1 0 0 1 o|" of tableau rows gives the equation fı 
zı — £2. We can then tighten the lower bound of f1, using the 
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ground bounds: I’(f1) := (a1) — u(v2) = 2 — 1 = 1. This 
bound matches the one that we had discovered dynamically, 
though we derived it using ground bounds only. 


V. DNN VERIFICATION WITH PROOFS 
A. Producing a proof-tree 


We now discuss how to leverage the results of Sec. IV 
in order to produce the entire proof tree for an UNSAT 
DNN verification query. Recall that the main challenge lies in 
accounting for the piecewise-linear constraints, which affect 
the solving process by introducing case-splits. 

Each case split performed by the solver introduces a branch- 
ing in the proof tree — with a new child node for each of the 
linear phases of the constraint being split on — and introduces 
new equations and bounds. In the case of ReLU, one child 
node represents the active branch, through the equation f = b 
and bound b > 0; and another represents the inactive branch, 
with b < 0 and 0 < f < 0. These new bounds become 
the ground bounds for this node: their Farkas vectors are 
reset to zero, and all subsequent Farkas vectors refer to these 
new bounds (as opposed to the ground bounds of the parent 
node). A new node inherits any previously-discovered dynamic 
bounds, as well as the Farkas vectors that explain them, from 
its parent; these vectors remain valid, as ground bounds only 
become tighter as a result of splitting (see the extended version 
of this paper [42]). 

For example, let us return to the query from Fig. 2 and the 
proof tree from Fig. 3. Initially, the solver decides to split on 
vı. This adds two new children to the proof tree. In the first 
child, representing the inactive case, we update the ground 
bounds u(b;) := 0, u( fı) := 0, and reset the corresponding 
Farkas vectors f,,(b1) and fu(fı) to 0. Now, Simplex can 
tighten the lower bound of bı using the first equation e!: 


l (b1) := Ua) — u(ag) =2-1=1 


resulting in the the updated fı(b1) = [1 0 0]7, as shown in 
Sec. IV, where we use vectors of size three since in this search 
state we have three equations. Observe this bound contradicts 
the upper ground bound of b1, represented by the zero vector. 
We can then use the vector 


fabi) — fib) =0- [1 0 O]'=[-1 0 Oj" 


as a proof for contradiction. Indeed, the matrix A’, which is 
obtained using the first three rows and columns of A as defined 
in Sec. III, corresponds to the tableau before adding any new 
equations. Observe that [-1 0 oj" - A’. V =0 gives the 
equation —£1ı +£2 +b1ı = 0. Given the current ground bounds, 
the largest value of the left-hand side is: 


—l(x1) + u(z2) + u(by) = -2+1 +0 = -1 


which is negative, meaning that no variable assignment within 
these bounds can satisfy the equation. This indicates that the 
proof node representing v;’s inactive phase is UNSAT. 

In the second child, representing v1’s active case, we update 
the ground bound /(b;) := 0 and the Farkas vector fı(b1) := 0. 


We also add the equation ef : fi = bı. Next, the solver 
performs another split on v2, adding two new children to the 
tree. In the first one (representing the inactive case) we update 
the ground bounds u(b2) := 0, u(fo) := 0, and reset the 
corresponding Farkas vectors f,,(b2) and fu(f2) to 0. In this 
node, we have a contradiction already in the ground bounds, 
since u( f2) := 0 but l( f2) := 0.25. The contradiction in this 
case is comprised of a symbol for fo. 

We are left with proving UNSAT for the last child, repre- 
senting the case where both ReLU nodes vj, v2 are active. 
For this node of the proof tree, we update the ground bound 
l(b2) := 0 and Farkas vector f;(b2) := 0, and add the equation 
e : fə = bg. Recall that previously, we learned the tighter 
bound l/(f;) = 1. With the same procedure as described in 
Sec. IV, we can update f;(f;)=[{1 0 0 1 cle Now, we 
can use e? : bo = —2 fı to tighten u’(b2) := —2I/(f1) = —2, 
and consequently update the Farkas vector: 


fu(bs) = —2- fi(fi) + coef (è) 
=-2-[1 0 0 1 oj'+[0 1 0 0 OT 
=[-2 1 0 -2 9g|' 


The bound u’(b2) = —2, explained by [-2 1 0 —2 0] i 
contradicts the ground bound l(b2) = 0 explained by the zero 
vector. Therefore, we get the vector 


[-2 1 0 -2 oj'-0=[-2 1 0 -2 OJ" 


as the proof of contradiction for this node. 


B. Bound tightenings from piecewise-linear constraints 


Modern solvers often use sophisticated methods [25], [50], 
[62] to tighten variable bounds using the piecewise-linear 
constraints. For example, if f = ReLU(b), then in particular 
b < f, and so u(b) < u( f). Thus, if initially u(b) = u( f) = 7 
and it is later discovered that u’(f) = 5, we can deduce that 
also u’(b) = 5. We show here how such tightening can be 
supported by our proof framework, focusing on some ReLU 
tightening rules as specified in the extended version of this 
paper [42]. Supporting additional rules should be similar. 

We distinguish between two kinds of ReLU bound tight- 
enings. The first are tightenings that can be explained via 
a Farkas vector; these are handled the same way as bounds 
discovered using interval propagation. The second, more com- 
plex tightenings are those that cannot be explained using an 
equation (and thus a Farkas vector). Instead, we treat these 
bound tightenings as lemmas, which are added to the proof 
node along with their respective proofs; and the bounds that 
they tighten are introduced as ground bounds, to be used in 
constructing future Farkas vectors. The proof for a lemma 
consists of Farkas vectors explaining any current bounds that 
were used in deducing it; as well as an indication of the 
tightening rule that was used. The list of allowed tightening 
rules must be agreed upon beforehand and provided to the 
checker; in the extended version of this paper [42], we present 
the tightening rules for ReLUs that we currently support. 


44 


For example, if f = ReLU(b) and u/(f) = 5 causes a 
bound tightening u’(b) = 5, then this new bound u’(b) = 5 
is stored as a lemma. Its proof consists of the Farkas vector 
ful f) which explains why u’(f) = 5, and an indication of the 
deduction rule that was used (in this case, u’(b) < u/(f)). 


VI. PROOF CHECKING AND NUMERICAL STABILITY 


Checking the validity of a proof tree is straightforward. 
First, the checker must read the initial query and confirm that 
it is consistent with the LP and piecewise-linear constraints 
stored at the root of the tree. Next, the checker begins a 
depth-first traversal of the proof tree. Whenever it reaches 
a new inner node, it must confirm that that node’s children 
correspond to the linear phases of a piecewise-linear constraint 
present in the query. Further, the checker must maintain a 
list of current equations and lower and upper bounds, and 
whenever a new node is visited — update these lists (i.e., add 
equations and tighten bounds as needed), to reflect the LP 
stored in that node. Additionally, the checker must confirm 
the validity of lemmas that appear in the node — specifically, 
to confirm that they adhere to one of the permitted derivation 
rules. Finally, when a leaf node is visited, the checker must 
confirm that the Farkas vector stored therein does indeed lead 
to a contradiction when applied to the current LP — by 
ensuring that the linear combination of rows created by the 
Farkas vector leads to a matrix row ` cj: 2; = 0, such that 
for any assignment of the variables, the left-hand side will 
have a negative value. 

The process of checking a proof certificate is thus much 
simpler than verifying a DNN using modern approaches, 
as it consists primarily of traversing a tree and computing 
linear combinations of the tableau’s columns. Furthermore, the 
proof checking process does not require using division for its 
arithmetic computations, thus making the checking program 
more stable arithmetically [44]. Consequently, we propose 
to treat the checker as a trusted code-base, as is commonly 
done [15], [49]. 


Complexity and Proof Size. Proving that a DNN verifi- 
cation query is SAT (by providing a satisfying assignment) 
is significantly easier than discovering an UNSAT witness 
using our technique. Indeed, this is not surprising; recall that 
the DNN verification problem is NP-complete, and that yes- 
instances of NP problems have polynomial-size witnesses (i.e., 
polynomial-size proofs). Discovering a way to similarly pro- 
duce polynomial proofs for no-instances of DNN verification 
is equivalent to proving that NP = coNP, which is a major 
open problem [8] and might, of course, be impossible. 


Numerical Stability. Recall that enhancing DNN verifiers 
with proof production is needed in part because they might 
produce incorrect UNSAT results due to numerical instability. 
When this happens, the proof checking will fail when checking 
a proof leaf, and the user will receive warning. There are, 
however, cases where the query is UNSAT, but only the proof 
produced by the verifier is flawed. To recover from these cases 


and correct the proof, we propose to use an external SMT 
solver to re-solve the query stored in the leaf in question. 

SMT solvers typically use sound arithmetic (as opposed to 
DNN verifiers), and so their conclusions are generally more 
reliable. Further, if a proof-producing SMT solver is used, 
the proof that it produces could be plugged into the larger 
proof tree, instead of the incorrect proof previously discovered. 
Although using SMT solvers to directly verify DNNs has been 
shown to be highly ineffective [48], [59], in our evaluation 
we observed that leaves typically represented problems that 
were significantly simpler than the original query, and could 
be solved efficiently by the SMT solver. 


VII. IMPLEMENTATION AND EVALUATION 


Implementation. For evaluation purposes, we instrumented 
the Marabou DNN verifier [50], [69] with proof production 
capabilities. Marabou is a state-of-the-art DNN verifier, which 
uses a native Simplex solver, and combines it with other 
modern techniques — such as abstraction and abstract inter- 
pretation [26], [27], [57], [62], [68], [71], advanced splitting 
heuristics [70], DNN optimization [63], and support for varied 
activation functions [6]. Additionally, Marabou has been ap- 
plied to a variety of verification-based tasks, such as verifying 
recurrent networks [43] and DRL-based systems [3], [5], [28], 
[51], network repair [34], [60], network simplification [33], 
[52], and ensemble selection [4]. 

As part of our enhancements to Marabou’s Simplex core, 
we added a mechanism that stores, for each variable, the 
current Farkas vectors that explain its bounds. These vectors 
are updated with each Simplex iteration in which the tableau 
is altered. Additionally, we instrumented some of Marabou’s 
Simplex bound propagation mechanisms — specifically, those 
that perform interval-based bound tightening on individual 
rows [25], to record for each tighter bound the Farkas vector 
that justifies it. Thus, whenever the Simplex core declares 
UNSAT as a result of conflicting bounds, the proof infrastruc- 
ture is able to collect all relevant components for creating the 
certificate for that particular leaf in the proof tree. Due to time 
restrictions, we were not able to instrument all of Marabou’s 
many bound propagation components; this is ongoing work, 
and our experiments described below were run with yet- 
unsupported components turned off. The only exception is 
Marabou’s preprocessing component, which is not supported, 
but is run before proof production starts. 

In order to keep track of Marabou’s tree-like search, we 
instrumented Marabou’s SmtCore class, which is in charge of 
case splitting and backtracking [50]. Whenever a case-split 
was performed, the corresponding equations and bounds were 
added to the proof tree as ground truths; and whenever a 
previous split was popped, our data structures would backtrack 
as well, returning to the previous ground bounds. 

In addition to the instrumentation of Marabou, we also 
wrote a simple proof checker that receives a query and a proof 
artifact — and then checks, based on this artifact, that the 
query is indeed UNSAT. That checker also interfaces with the 
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cvc5S SMT solver [14] for attempting recovery from numerical 
instability errors. 


Evaluation. We used our proof-producing version of Marabou 
to solve queries on the ACAS-Xu family of benchmarks for 
airborne collision avoidance [45]. We argue that the safety- 
critical nature of this system makes it a prime candidate for 
proof production. Our set of benchmarks was thus comprised 
of 45 networks and 4 properties to test on each, producing a 
total of 180 verification queries. Marabou returned an UNSAT 
result on 113 of these queries, and so we focus on them. In the 
future, we intend to evaluate our proof-production mechanism 
on other benchmarks as well. 

We set out to evaluate our proof production mechanism 
along 3 axes: (i) correctness: how often was the checker able 
to verify the proof artifact, and how often did Marabou (prob- 
ably due to numerical instability issues) produce incorrect 
proofs?; (ii) overhead: by how much did Marabou’s runtime 
increase due to the added overhead of proof production?; and 
(iii) checking time: how long did it take to check the produced 
proofs? Below we address each of these questions. 

Correctness. Over 1.46 million proof-tree leaves were cre- 
ated and checked as part of our experiments. Of these, 
proof checking failed for only 77 leaves, meaning that the 
Farkas vector written in the proof-tree leaf did not allow 
the proof checker to deduce a contradiction. Out of the 113 
queries checked, 97 had all their proof-tree leaves checked 
successfully. As for the rest, typically only a tiny number 
of leaves would fail per query, but we did identify a single 
query where a significant number of proofs failed to check 
(see Fig. 4). We speculate that this query had some intrinsic 
numerical issues encoded into it (e.g., equations with very 
small coefficients [20]). 
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Fig. 4: Number of queries per number of leaves with incorrect 
proofs. 


Next, when we encoded each of the 77 leaves as a query 
to the cvc5 SMT solver [14], it was able to show that all 
queries were indeed UNSAT, in under 20 seconds per query. 
From this we learn that although some of the proof certificates 
produced by Marabou were incorrect, the ultimate UNSAT 
result was correct. Further, it is interesting to note how quickly 
each of the queries could be solved. This gives rise to an 
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Fig. 5: Proof production and checking time comparison — absolute (left) and relative (right) 


interesting verification strategy: use modern DNN verifiers to 
do the “heavy-lifting”, and then use more precise SMT solvers 
specifically on small components of the query that proved 
difficult to solve accurately. 

Overhead and Checking Time. In Fig. 5, we compare the 
running time of vanilla Marabou, the overhead incurred by 
our proof-production extension to Marabou, and the checking 
time of the resulting proof certificates. We can see that the 
overhead of proof production time is relatively small for all 
queries (an average overhead of 5.7%), while the certification 
time is non-negligible, but shorter than the time it takes to 
solve the queries by a factor of 66.5% on average. 


VIII. RELATED WORK 


The importance of proof production in verifiers has been 
repeatedly recognized, for example by the SAT, SMT, and 
model-checking communities (e.g., [15], [21], [38]). Although 
the risks posed by numerical imprecision within DNN verifiers 
have been raised repeatedly [12], [44], [48], [47], we are 
unaware of any existing proof-producing DNN verifiers. 

Proof production for various Simplex variants has been 
studied previously [56]. In [24], Dutertre and de Moura study a 
Simplex variant similar to ours, but without explicit support for 
dynamic bound tightening. Techniques for producing Farkas 
vectors have also been studied [10], but again without support 
for dynamic bound tightening, which is crucial in DNN 
verification. Other uses of Farkas vectors, specifically in the 
context of interpolants, have also been explored [9], [18]. 

Other frameworks for proof production for machine learning 
have also been proposed [7], [35]; but these frameworks are 
interactive, unlike the automated mechanism we present here. 


IX. CONCLUSION AND FUTURE WORK 


We presented a novel framework for producing proofs of un- 
satisfiability for Simplex-based DNN verifiers. Our framework 
constructs a proof tree that contains lemma proofs in internal 
nodes and unsatisfiability proofs in each leaf. The certificates 
of unsatisfiability that we provide can increase the reliability of 
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DNN verification, particularly when floating-point arithmetic 
(which is susceptible to numerical instability) is used. 

We plan to continue this work along two orthogonal paths: 
(i) extend our mechanism to support additional steps per- 
formed in modern verifiers, such as preprocessing and addi- 
tional abstract interpretation steps [53], [62]; and (ii) use our 
infrastructure to allow learning succinct conflict clauses. Dur- 
ing search, the Farkas vectors produced by our approach could 
be used to generate conflict clauses on-the-fly. Intuitively, 
conflict clauses guide the verification algorithm to avoid any 
future search for a satisfying assignment within subspaces of 
the search space already proven to be UNSAT. Such clauses 
are a key component in modern SAT and SMT solvers, and 
are the main component of CDCL algorithms [74] — and 
could significantly curtail the search space traversed by DNN 
verifiers and improve their scalability. 
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Abstract—The TBUDDY library enables the construction and 
manipulation of reduced, ordered binary decision diagrams 
(BDDs). It extends the capabilities of the BUDDY BDD pack- 
age to support trusted BDDs, where the generated BDDs are 
accompanied by proofs of their logical properties. These proofs 
are expressed in a standard clausal framework, for which a 
variety of proof checkers are available. Building on TBUDDY 
via its application-program interface (API) enables developers to 
implement automated reasoning tools that generate correctness 
proofs for their outcomes. In some cases, BDDs serve as the 
core reasoning mechanism for the tool, while in other cases they 
provide a bridge from the core reasoner to proof generation. 
A Boolean satisfiability (SAT) solver based on TBUDDY achieves 
polynomial scaling when generating unsatisfiability proofs for a 
number of problems that yield exponentially-sized proofs with 
standard solvers. It performs particularly well for formulas 
containing parity constraints, where it can employ Gaussian 
elimination to systematically simplify the constraints. 


I. INTRODUCTION 


Proof generation has become a core requirement for 
Boolean satisfiability (SAT) solvers when they encounter an 
unsatisfiable problem. The SAT solver generates a detailed 
proof in a standard proof format. An independent proof 
checker can then affirm that the problem is indeed unsatis- 
fiable, ruling out any false negative results due to a bug in 
the SAT solver’s algorithms or implementation. Most modern 
solvers are based on conflict-driven clause-learning (CDCL) 
algorithms, and these can readily be extended to gener- 
ate proofs in the Deletion Resolution Asymmetric Tautology 
(DRAT) proof framework [1], [2]. Like resolution proofs [3], 
a DRAT proof is a clausal proof consisting of a sequence 
of clauses, each of which preserves the satisfiability of the 
preceding clauses. An unsatisfiability proof starts with the 
clauses of the input formula and ends with an empty clause, 
indicating logical falsehood. The fact that this clause can be 
derived from the original formula proves that the original 
formula cannot be satisfied. 

Although a number of SAT solvers based on Binary De- 
cision Diagrams (BDDs) have been implemented over the 
years [4]-[8], most of these predated the era when proof 
generation became a priority. In 2006, Biere, Jussila, and Sinz 
demonstrated that the underlying logic behind standard BDD 
algorithms can be encoded as steps in an extended resolution 
framework [9], [10]. Extended resolution [11], [12] augments 
standard resolution by allowing proofs to introduce extension 
variables, serving as abbreviations for Boolean formulas over 
the input and other extension variables. This can yield proofs 
that are exponentially more compact than standard resolution 
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proofs [13]. Biere, Jussila, and Sinz use this capability by in- 
troducing an extension variable for each BDD node generated. 
The logic for each recursive step of standard BDD operations, 
based on the Apply algorithm [14], can then be expressed with 
a short sequence of proof steps. TBUDDY builds on this work. 

The DRAT framework also supports extension variables. 
Our solver PGBDD [15], [16] (for “proof-generating BDD”) 
demonstrated that a BDD-based SAT solver can generate 
DRAT proofs of unsatisfiability by integrating proof gen- 
eration into the BDD package. Our second solver PGPBS 
(for “proof-generating pseudo-Boolean solver”) augments the 
SAT solver with a pseudo-Boolean constraint solver, enabling 
it to generate DRAT proofs of unsatisfiability for problems 
where the input formula, described in conjunctive normal form 
(CNF), encodes parity and cardinality constraints [17]. PGPBS 
relies on the constraint solver to detect that the formula is 
unsatisfiable. BDDs serve only as a mechanism to prove that 
1) each of the extracted constraints is implied by the input 
formula, and 2) each step of the solver preserves satisfiability. 
These two solvers achieved polynomial scaling while gener- 
ating unsatisfiability proofs for a number of challenging SAT 
problems. 

The prototype solvers PGBDD and PGPBS demonstrated 
that BDDs can provide a useful framework for proof- 
generating automated reasoning tools, but their performance, 
in terms of both speed and capacity, was limited by their 
Python implementations. In this work, we describe TBUDDY, 
a high performance library for constructing and manipulating 
trusted BDDs. TBUDDY builds on BUDDY, a BDD package 
written by Jørn Lind-Nielsen while he was a PhD student at the 
Technical University of Denmark in the late 1990s [18]. It has 
subsequently been used and modified by a number of others, 
although the current version (2.4) has been unchanged on 
Sourceforge since 2014. BUDDY is written in C but has a C++ 
interface that provides more convenient memory management. 
These features were carried over to the implementation of 
TBUDDY. 

Although there are a number of BDD packages available, we 
chose to implement our proof-generating library by extending 
BUDDY for several reasons: 


e Multiple studies have shown that BUDDY generally per- 
forms as well as other BDD packages [19]-[21]. 

e BUDDY references nodes as integer indices into an array, 
rather than as pointers to a node data structure. As a 
result, it can manage BDDs with up to two billion (231) 
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nodes using four-byte references, rather than the eight- 
byte pointers required for modern, 64-bit machines. 
BUDDY does not use complement pointers [22], [23] 
to denote Boolean negation. Although these can reduce 
BDD sizes and enable constant-time complementation, 
they would greatly complicate adding proof generation. 
Complement pointers rely on a symmetry between True 
and False that is not present in clausal representations. 
The BUDDY code is clear and concise. The complete 
package, prior to our modifications, consists of around 
13,000 lines of code. By contrast, the core of the popular 
CUDD package [24] has over 72,000 lines of code. CUDD 
includes many features that are not relevant for this work 
but would requiring updating as the core data structures 
are changed. 

BUDDY supports dynamic variable ordering [25]. We do 
not use that feature directly, since it would be challenging 
to keep the proof information updated as variables are 
swapped in the BDD. However, it enables maintaining 
a distinction between the numbering of variables in the 
input file and the ordering of those variables within the 
BDD. We have found this capability vital for achieving 
good performance on some benchmarks. 


This paper describes the design and implementation of 
TBUDDY, as well as TBSAT, a proof-generating SAT solver 
implemented using TBUDDY. It presents experimental results 
for several scalable benchmarks that are intractable for current 
CDCL solvers. A complete version of the code is available at 
https://github.com/rebryant/tbuddy-artifact. 


II. PROOF GENERATION WITH BDDs 


Our immediate goal is to support the operations of a BDD- 
based SAT solver, generating one or more solutions when the 
formula is satisfiable and an unsatisfiability proof when it is 
not. Future uses of a proof-generating BDD package include a 
variety of automated reasoning tasks that would benefit from 
the assurances provided by checkable proofs of correctness. 


A. Notation 


Formulas are defined over a set of Boolean variables X = 
{x1,@2,...,2,}. The symbols u, v and w also denote Boolean 
variables, possibly with subscripts. The notation uw denotes 
complement of variable u. A literal £ is either a variable or 
its complement. A clause C consists of a set of literals, and 
a formula @ consists of a set of clauses. We denote a clause 
as a disjunction of literals, enclosed in square brackets, e.g., 
fu v UV w]. A clause consisting of a single literal /, denoted 
[4], is a unit clause. 

An assignment & is a mapping from the input variables X 
to the set {0,1}, where 0 represents false, and 1 represents 
true. Assignment a is said to satisfy clause C if there is some 
literal £ € C such that £ = x and a(x) = 1, or £ = & and 
a(x) = 0. Assignment a satisfies formula ¢ if it satisfies every 
clause in @. A formula ¢ is said to be satisfiable if it has a 
satisfying assignment and to be unsatisfiable if no satisfying 
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TABLE I 
DEFINING CLAUSES FOR EXTENSION VARIABLE u REPRESENTING BDD 
NODE u 
Notation Formula Clausal Representation 
Nonterm. child Child is 1 Child is 0 
HD(u) ers (usu) [EVUVui] 1 [z v ul 
LD(u) z—> (u—uo) [xVUV uo] 1 [x v u] 
HU(u) zou >u) Evuūnvau) Evy] 1 
LU(u) z= (uo>u) [eVūoVu] [evu] 1 


assignment exists. A formula containing the empty clause [] 
cannot be satisfied. 

A clausal proof consists of a sequence of clauses 
C1, Co2,.-.,;Cm,Cm41,---,C; where the first m clauses are 
those of the input formula ¢, while the subsequent clauses 
have the property that they preserve the satisfiability of the 
preceding clauses. That is, for all m < i < t, if the formula 
consisting of clauses {C1,...,C;} is satisfiable, then so is the 
formula {C1,...,C;,Cjs1}. A proof of unsatisfiability has an 
empty clause as its final clause. The fact that this clause can 
be derived via a sequence of the steps from the input formula 
proves that the formula is unsatisfiable. 


B. BDD Extension Variables and Defining Clauses 


The BDD package maintains a directed acyclic graph con- 
sisting of a set of nodes, where each node w is either terminal 
or nonterminal. There are just two terminal nodes: To, repre- 
senting false, and T}, representing true. Nonterminal node u 
has an associated variable Var(w) € X as well as child nodes 
Low(u) and High(w). Each BDD node u represents a Boolean 
function, denoted [u]. Terminal nodes represent constant func- 
tions: [To] = 0, and [Tı] = 1. The function for nonterminal 
node u is defined recursively using the ITE operator (short for 
“if-then-else”), where ITE(u,v, w) = (u A v) V (~u A w): 


[ul ITE (veiw, [High(u)], [tontu ) (1) 

The DRAT proof system supports an extension rule, similar 
to that of extended resolution [11], [12]. That is, the proof can 
define and reference extension variables serving as abbrevia- 
tions for Boolean formulas over input variables and previous 
extension variables. Extension variable u encoding Boolean 
formula F is introduced by including a set of defining clauses 
in the proof encoding the formula u <> F. This capability is 
key to proof generation with BDDs, with an extension variable 
defined for every nonterminal node in the BDD. 

An assignment a over the input variables can be uniquely 
extended to assign values to the extension variables. Extension 
variable u is assigned the value resulting from applying its 
defining formula F to the values assigned to the input and 
previous extension variables. For assignment a and extension 
variable u, we therefore have a(u) € {1,0}. 

As with the approach of Biere, Sinz, and Jussila [9], [10], 
each nonterminal BDD node has an associated extension vari- 
able. Nodes are denoted by boldface letters, possibly with sub- 
scripts, e.g., u, v, and v1, while their corresponding extension 


variables are denoted with a normal face, e.g., u, v, and vı. 
The extension variables associated with the nonterminal nodes 
of the BDD provide the proof with a semantic definition of 
how BDDs encode Boolean functions according to Equation 1. 
More precisely, for nonterminal node v, let Ex(v) = v be 
the extension variable associated with the node. For the two 
terminal nodes, define Ex(Tọ) = 0 and Ex(T,) = 1. For 
nonterminal node u, let x = Var(w), wy = Ex(High(w)), and 
ug = Ex(Low(w)). Then the defining clauses for u encode 
the formula u + ITE(«, ui, 9). These clauses are shown in 
Table I. As can be seen, when both children are nonterminal, 
there will be four clauses, each containing three literals. When 
one or more children are terminal nodes, some of the formulas 
for the defining clauses degenerate into tautologies (indicated 
by table entry 1.) These are not included among the defining 
clauses. Others have just two literals. For BDD node u, we 
let Def(u) denote the set of defining clauses for all nodes in 
the subgraph with root u. 

Consider assignment a over the input variables extended 
to assign values to the extension variables. We will say that 
assignment a satisfies BDD root u with associated extension 
variable u if a(u) = 1. This will occur precisely for those 
assignments where [u], the Boolean function associated with 
u, evaluates to 1. 


C. RUP Proof Steps 


Each logical inference for the subset of the DRAT proof 
system we use is based on an application of the reverse unit 
propagation (RUP) rule [26], [27]. RUP provides an easily 
checkable way to combine a linear sequence of resolution steps 
with subsumption. Let C = [41 V l2 V ++- V £p] be a clause to 
be proved and let clauses D1, D2,..., Dx be a sequence of 
supporting antecedent clauses occurring earlier in the proof. 
The RUP step proves that A\,-;-;,.Di > C by showing that 
a combination of the antecedents plus the negation of C leads 
to a contradiction. The negation of C is the formula Ly Abo A 
. -Abp having a CNF representation consisting of unit clauses 
[¢;] for 1 < i < p. A RUP check processes the clauses of the 
antecedent in sequence, inferring additional unit clauses. In 
processing clause D;, if all but one literal in the clause is the 
negation of one of the accumulated unit clauses, then we can 
add this literal to the accumulated set. The final step, with 
clause D;,, must cause a contradiction, i.e., all of its literals 
are falsified by the accumulated unit clauses. 


D. The Trusted BDD API 


The TBUDDY package supports the generation of trusted 
BDDs (TBDDs). These are ones that have been formally 
certified to be implied by the input formula. More precisely, 
for a trusted BDD with root node u and associated extension 
variable u, any assignment a to the input variables that satisfies 
the input formula must also assign 1 to u. This can be written 
as @,Def(u) = u. This property is proved by generating a 
sequence of proof clauses leading to a proof of the validating 
clause, consisting of unit clause [u]. We use the notation ù to 
indicate that node u is trusted. 
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/*x Generate TBDD from input clause «/ 
tbdd tbhdd_from_clause_id(int i); 


/* Form conjunction of two TBDDs x/ 
tbdd tbdd_and (tbdd u, tbdd v); 


/x Upgrade BDD v to TBDD */ 
tbdd tbdd_validate (bdd v, tbdd u); 


/x Generate proof of clause «/ 


int tbdd_validate_clause(ilist lits, tbdd u); 


Fig. 1. Trusted BDD API Function Prototypes 


The TBUDDY API provides several procedures that enable 
the generation of TBDDs. Their prototypes are shown in 
Figure 1. In these, data types bdd and tbdd represent BDDs 
and TBDDs, respectively, as is described in Section III-A. Data 
type ilist is the API’s representation of integer lists. 

The tbdd_from_clause_id operation generates the 
BDD representation u; of input clause C;, as well as a proof 
of unit clause [u;]. The BDD representation of a clause is a 
linear chain. The proof that C;,Def(u) | u; consists of a 
single RUP step, with C; plus a subset of the defining clauses 
for the nodes in the chain as antecedents [10]. 

Given trusted BDDs ù and ù, the tbdd_and operation 
first generates the BDD representation w of their conjunction. 
It also generates a proof that u A v —> w, given by the clause 
[u v UV w]. It then uses a RUP step with this clause plus 
unit clauses [u] and [v] to prove the unit clause [w], upgrading 
node w to w. As is described below, the BDD construction 
and the proof generation are performed by a version of the 
BDD APPLYAND operation that generates both a BDD node 
and a sequence of proof steps [15], [16]. 

The standard version of the APPLYAND procedure recur- 
sively traverses the nodes for the two arguments and generates 
intermediate result nodes [14]. It maintains an operation table 
of previously computed results to ensure polynomial complex- 
ity. Given arguments u and v, it directly handles the cases 
where one argument is a terminal node. Failing this, it looks 
in the table with key (u, v, And) and returns any stored result. 
Otherwise, a set of recursive calls is required. The program 
chooses variable x as the least (in the BDD variable ordering) 
among variables Var(u) and Var(v) and splits into two cases, 
given by nodes u; and vj, and nodes uo and vo. It recursively 
computes nodes w and wọ as the conjunctions of u; and vj, 
and of wo and vo, respectively. When wı = wo, this becomes 
the returned result w. Otherwise node w is created having 
Var(w) = x, High(w) = wy), and Low(w) = wo. Before 
returning, an entry with key (u,v, And) and result w is added 
to the table. 

The modified version of APPLYAND operation follows this 
recursive structure, such that a recursive call generating node 
w as the conjunction for nodes u and v also generates a proof 
of the clause [ZV UV w], i.e., that uA v > w. We refer 
to this proof step as the justifying clause for the operation. 
The recursive calls will have generated proofs of the clauses 


[U1 V U1 V wy] and [Zo V Vo V wo]. In general, the desired result 
can require two RUP steps. The first generates a proof of the 
intermediate result x + (uAv — w) given by clause [ZEVUVUV 
w] using as antecedents the defining clauses HD(u), HD(v), 
and HU(w), as well as the recursive result [ū; V vı V wi]. 
The second step proves the target clause using as antecedents 
the intermediate result, defining clauses LD(w), LD(v), and 
LU(w), and the recursive result [uo V Uo V wo]. For special 
cases, such as when some of the arguments are terminal nodes, 
only a subset of these antecedents is required. In some cases, 
the desired proof degenerates to a single proof step. The proof 
generation code in TBUDDY attempts to generate a single-step 
proof when one of the recursive results is a tautology. When 
this fails, or for the more general case, it generates a two-step 
proof. A built-in RUP checker determines which clauses to 
use as antecedents and can detect whether the proof succeeds 
or fails. The intermediate clause generated in a two-step proof 
can be deleted immediately after the second clause is added, 
and therefore there is a single justifying clause associated with 
each recursive operation. 

Observe that to reuse results from the operation table, the 
program needs to reference its justifying clause. This requires 
augmenting the table entry with a field to hold an identifier 
for the justifying clause, as is discussed in Section III-A. 

The tbdd_validate operation enables an ordinary BDD 
with root v to be upgraded to trusted node ù based on trusted 
node ù. When called, the program first generates a proof 
of the implication u — v, given by the clause [w V v]. 
It then uses a RUP step with this clause plus unit clause 
[u] to prove the unit clause [v]. The implication proof is 
generated by PROVEIMPLICATION [15], an operation that 
traverses the BDD and generates proof steps without adding 
any nodes. At each step on arguments u’ and v’, it generates 
a proof of the justifying clause [w’ V v’'], i.e., that u > v’, 
using a simplified version of the proof structure used for the 
conjunction operation. 

Some applications of TBDDs combine BDD and 
clausal reasoning, alternating between the two forms. 
The tbdd_validate_clause operation transfers the 
trust embodied in TBDD node ù to a clause C, generating a 
proof of Def(w),u H C: This function requires TBUDDY to 
generate a sequence of proof steps, concluding with a RUP 
step with the specified clause. In some cases, the step can 
be performed directly by tracing a path in the BDD from u 
down to node To and listing some of the defining clauses 
along the way as antecedents. In cases where the path is not 
unique, the prover must first generate a BDD representation 
v of the clause, validate v, and then trace the path from v to 
To. 


E. Proof File Format 


There are several different file formats for encoding a DRAT 
proof, representing different trade-offs between the level of 
detail that must be supplied by the proof generator, versus 
the effort required to check the validity of the proof. With 
the LRAT format [28], each proof step must be accompanied 
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by a hint. For a RUP step, the hint specifies the sequence of 
antecedent clauses. These proofs can be checked efficiently 
by the program LRAT-CHECK. There are also several formally 
verified checkers for LRAT proofs [28], [29]. By contrast, no 
hints are given with the DRAT format [2]. For each RUP step, 
the checker must identify a sequence of prior clauses that can 
serve as the antecedent. This format is accepted by the widely 
used DRAT-TRIM checker. Internally, DRAT-TRIM operates by 
adding the hints and then invoking an LRAT checker. The 
FRAT format [30] spans the two extremes of hints versus no 
hints by making the hints optional. It also operates by adding 
hints and invoking an LRAT checker. TBUDDY can generate 
proofs in any of these formats. Here we describe properties of 
the LRAT file format that influence how BUDDY encodes and 
stores proof information. Generating proofs in other formats 
requires storing additional information. For long executions, 
the proofs can range up to one billion clauses. These would 
be far too long for the DRAT-TRIM checker, due to the high cost 
of generating hints. In practice, therefore, it is best to either 
generate LRAT proofs or to generate FRAT proofs where the 
steps involving BDD operations include hints. 

Following the conventions of the DIMACS format for 
encoding CNF formulas, the proof clauses for a formula with 
n variables and m clauses are encoded using signed integers 
to represent literals, where variable x; is represented as the 
value 2, and its complement as —2. Each clause in the proof 
is assigned a numeric clause ID, with the first m of these 
corresponding to the input clauses (which are not included in 
the proof file). Clause IDs must be in ascending order, but they 
need not be consecutive. Extension variables are represented 
by integers with values greater than n. RUP proof steps are 
encoded by giving the clause ID, the literals of the clause, and 
a list of the antecedent clause IDs. LRAT also supports clause 
deletion, where a list of clause IDs is provided, indicating 
that the proof will no longer use these clauses as antecedents. 
Deleting clauses whenever possible is critical for the proof 
checker, since it must retain copies of all active clauses, i.e., 
those that have been added but not yet deleted. 


III. IMPLEMENTATION 


With this as background, we can now describe how the 
BUDDy BDD package was modified to support proof gener- 
ation. As we have seen, the key requirements are: 

Each time a new BDD node is created, it must be assigned 
an extension variable and its defining clauses must be 
added to the proof. 

For each input clause C;, its BDD representation u; must 
be generated, along with a proof of validating clause [u;]. 
Every recursive step of the APPLYAND and 
PROVEIMPLICATION operations must generate one 
or two proof steps. 

The result nodes and proof steps generated by BDD 
operations must be stored for later reuse. 

A RUP step is required to prove validating clause [u] 
when BDD root u is generated by conjunction or impli- 
cation testing. 


(A) Node data (B) Cache entry (C) TBDD 

Tyi mile, ee op root 
low argl vclause 
high arg2 1e(@)_Aliaioleyx 
next arg3 
head TES 
xvar jclause 

dclause 


Fig. 2. Data structures for nodes (A) cache entries (B), and TBDDs (C). 
Each rectangle represents four bytes. Proof generation requires adding the 
fields shown in red. 


e The defining clauses for the nodes and the clauses gen- 
erated by RUP steps should be deleted when they are no 
longer required for subsequent proof steps. 


These capabilities can all be incorporated into the basic BDD 
operations, as well as the supporting operations to manage the 
data structures. 


A. Data Structures 


Figure 2(A) and (B) show the fields in the two major data 
structures for BUDDY, with added fields (shown in red) to 
support proof generation. It also shows the representation for 
a TBDD (C). A BDD node in BUDDY is indicated by an 
integer, providing an index into an array of node structures, 
each having the fields shown in (A). Nodes Tp and T} are 
represented by indices 0 and 1, respectively. Each rectangle in 
the figure represents four bytes. The node array integrates the 
set of BDD nodes with the unique table, providing a mapping 
from the children and variable for each node to the node itself. 
In the node data structure (A), the fields indicated in gray 
encode the node. Three values are packed into the first four- 
byte word: 1v1, encoding the position of the node variable in 
the BDD variable ordering, rc, a reference count used to track 
external references to the node, and mk, a single bit used to 
support mark-sweep garbage collection. The indices for the 
two children low and high occupy the second and third 
words. The fields shown in blue encode the unique table, with 
the next field forming a link in the linked list implementing 
a hash table bucket, and the head field providing the head of 
the linked list for all nodes that hash to this index. 

As mentioned earlier, to support dynamic variable ordering, 
BUDDY distinguishes between the level of a variable, giving 
its position in the BDD variable ordering, and the integer rep- 
resentation of the variable, with permutation vectors providing 
the mapping between these two. We use this feature to allow 
the BDD variable ordering to be independent of the numbering 
of variables in the input file. 

Supporting proof generation requires adding two fields to 
the node data structure. The xvar field gives the associated 
extension variable, encoded as an integer having a value 
greater than the number of input variables n. When a node 
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is created, the next four clause IDs are assigned to its defining 
clauses, even if only some subset of these is added to the proof. 
The dclause field stores the first of these—the remaining 
three can be computed as offsets from this field. In skipping 
some possible clause IDs, we add some sparseness to the 
ID space. Considering that we can only encode around two 
billion (23t — 1) clause IDs, and proofs can routinely reach 
one billion clauses, this might seem wasteful. However, only 
a small fraction of the nodes in large BDDs will have terminal 
nodes as children, and so the vast majority of nodes will 
require the full complement of four defining clauses. 

Like other BDD packages [22], BUDDY stores its table of 
previously computed results as a direct-mapped cache indexed 
by a hash of the operation and arguments.' Before performing 
the recursive steps of an APPLY operation, the table is first 
referenced to see if a suitable result has already been gener- 
ated. When a new result is added to the table, any previous 
result that hashes to the same position is overwritten. The 
entries in the cache are shown in Figure 2(B). The standard 
entries (shown in gray) encode the operation, arguments (up 
to three), and the result node, each given as a four-byte 
integer. In the event the operation is either APPLYAND or 
PROVEIMPLICATION, reusing the cached result also requires 
the ID of the justifying clause. This is stored in the field 
jclause. 

The added fields enable TBUDDY to track the clause IDs 
of the defining clauses for the active BDD nodes and the 
justifying clauses of the cache entries. Significantly, TBUDDY 
need not keep copies of the clauses themselves. When actual 
clauses are required to support proof generation, they can be 
recreated based on other information stored with the node or 
the cache entry. 

We can see that the node data structure expands from 20 
bytes to 28 in order to support proof generation. Cache entries 
require 24 bytes with or without proof generation, since an 
eight-byte field is used to store results for operations that 
return floating-point numbers. We configured the program to 
maintain a cache size that has 1/8 the number of entries as 
the node array. Therefore, adding proof generation required 
growing these two data structures from combined total of 23 
bytes per node to 31 bytes per node, an increase of 1.35x. 
These are the only two data structures that grow in proportion 
to the number of BDD nodes. 

Figure 2(C) shows the representation of a TBDD. It consists 
of three integers. The first identifies the root node and the 
second gives the clause identifier for the validating clause. The 
third field, labeled rc_index, supports reference counting 
of TBDDs. This count is distinct from the reference count 
for the root node, since there may be references to a BDD 
node that are independent of its use in a TBDD. The reference 
count for a TBDD tracks references to possible uses of the 
validating clause in proof generation. Once the count drops 
to zero, the clause can be deleted. Since TBDD structures 


'The standard BUDDY package maintains seven separate caches to support 
different operations. We combined these into a single, unified cache. 


are passed by value, they cannot hold actual reference counts. 
Instead, a separate table of reference counts is maintained, 
with the rc_index field providing an index into this table. 
In typical applications, fewer than 1% of the BDD nodes serve 
as TBDD roots, and so the space required by this table is 
negligible. 

As can be seen, the modifications to support proof gen- 
eration are fairly modest. In terms of code, the original 
BUDDY package contains 13,186 lines of source code. The 
TBUDDY package expands this to 18,030, with 1,061 lines 
added to existing files, 2,715 lines in new files to support 
proof generation and TBDDs, and 1,068 in new files to support 
parity reasoning. As noted above, the memory used increases 
by around 1.35x. The impact on runtime is more variable; we 
show experimental results in Section V. 


B. BDD Management 


BUDDY represents all of the nodes as a single array. This 
array starts with an initial allocation and is expanded as more 
nodes are added. Each expansion requires allocating a larger 
array, copying over existing nodes, and reconstructing the 
unique table and free list. Before expanding, it attempts to free 
existing nodes by performing garbage collection, reclaiming 
nodes that cannot be reached by any reference external to the 
data structure. Garbage collection is supported by 1) having 
each node store a reference count indicating the number of 
external references to the node, and 2) performing mark-sweep 
garbage collection to determine which nodes are unreachable. 
Nodes with nonzero reference counts provide the starting 
points of the marking phase. Both resizing the node array and 
performing garbage collection cause the entire cache to be 
flushed, with all entries marked as invalid. Garbage collection 
can occur at any point during the program operation, including 
in the middle of a series of recursive calls. To support this 
capability, a stack is maintained indicating intermediate nodes 
that may be required at future points in the outstanding calls. 
These nodes are also incorporated into the marking phase. 

Garbage collection and cache flushing provide the means to 
manage the active clauses in a proof. That is, when a node 
is reclaimed during the sweep phase, its defining clauses are 
deleted. When a cache entry is evicted, either because it is 
overwritten or the cache is flushed, its justifying clause is 
deleted. To support the ability to perform garbage collection in 
the middle of a sequence of recursive calls, the deletion steps 
are not added to the proof directly. Rather, they are added 
to a list, which is cleared as the top-level of the recursion 
completes. As mentioned earlier, the validating clauses for 
TBDDs are managed via a separate set of reference counts. 
The C++ interfaces to the package automatically handle the 
reference counting for both BDDs and TBDDs. 


IV. CAPABILITIES SUPPORTED BY TBUDDY 


Building on the basic support for TBDDs, we have created 
several additional libraries and a BDD-based SAT solver. We 
describe these capabilities here and present some experimental 
results in Section V. 
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A. Parity Reasoning 


Parity constraints arise in a variety of contexts, but they are 
not well handled by current CDCL solvers. A parity constraint 
is an equation of the form: 


Li, O Ti, B+ OX, (2) 


The variables in this constraint are a subset of the input 
variables, and the phase p is | for odd parity and 0 for even. 
Adding two parity constraints creates a new parity constraint. 
Gaussian or Gauss-Jordan elimination systematically adds 
constraints to yield a reduced set [31]. It can determine when 
the set of constraints cannot be satisfied. When the constraints 
are satisfiable, it can be used to derive a satisfying assignment. 

Manipulating parity constraints is especially efficient for 
BDDs. The BDD representation of a constraint with k vari- 
ables contains 2k +1 nodes, independent of the BDD variable 
ordering. As we have demonstrated [17], a set of parity 
constraints encoded in CNF can be automatically extracted 
from an input formula, and BDD-based proofs of unsatis- 
fiability can be generated using Gaussian elimination. The 
TBUDDY package provides the necessary support for the proof 
generation portion of this task. 

Our constraint library represents a parity constraint as a 
list of integer variable IDs, a phase, and a TBDD giving the 
BDD representation of the constraint as well as the ID of a 
validating clause justifying that this constraint is implied by 
the input formula. An input constraint is converted into this 
representation by 1) forming the TBDD representations of the 
input clauses that encode it, 2) conjuncting them, and 3) using 
this TBDD to validate a BDD representation of the constraint. 
Each time constraints having TBDD representations ù and ù 
are summed to form a constraint with BDD representation 
t, we use the conjunction operation to generate TBDD w 
representing the conjunction of the constraints and validate 
the sum by calling todd_validate (t, w). 

Applying Gaussian elimination requires first running a 
preprocessor to identify how the clauses encode parity con- 
straints [17]. The program creates a schedule listing equations 
of the form of Equation 2 and identifying which clauses 
encode each of these. It also provides a list of the internal 
variables, i.e., those appearing only in parity constraints. Im- 
plicitly, all other variables are external. Gaussian elimination 
reduces the set of constraints to a smaller set over only the 
external variables. If the reduced set contains a constraint of 
the form 0 = 1, then the original set cannot be satisfied. 
Otherwise, any solution of the reduced set can be expanded 
into a solution of the original set. In either case, the reduced 
constraints have TBDD representations and can therefore be 
used in proof generation. 

Our Gaussian elimination routine attempts to preserve the 
sparseness found in typical parity constraint problems, where 
the number of variables in the constraints is far less than the 
total number of variables in the problem. Maintaining sparse- 
ness requires a successful strategy for pivot selection. Consider 
a set of parity constraints P,, P2,..., Pm, each of the form of 
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Equation 2. Let the notation x; € P; indicate that constraint P; 
contains variable xj. Each elimination step requires selecting 
a pivot constraint P, and a pivot variable x, € P,. It then 
eliminates variable x; from all other constraints P; for which 
x, E€ P; by replacing P; with the sum P;6P;. Our routine uses 
a greedy pivot selection strategy attributed to Markowitz [32], 
[33]: Let c, be the number of nonzero variables in constraint 
P, and r; be the number of constraints containing variable x+. 
Then a constraint P, and variable x, € P, are selected such 
that the cost function (c,—1)-(r;—1) is minimized. That cost 
is an upper bound on the net number of variables that will be 
added to the constraints when generating the sums P; ® Ps. 


B. The TBSAT SAT Solver 


The TBSAT solver builds on the TBUDDY library. It can gen- 
erate multiple solutions for satisfiable formulas and proofs of 
unsatisfiability for unsatisfiable formulas. It starts by reading 
the input clauses and forming their TBDD representations. The 
overall control flow is determined by the combination of an 
optional input schedule file and bucket elimination, expanding 
on the capabilities implemented in our prototype solvers 
PGBDD [15] and PGPBS [17]. The schedule file can serve two 
different roles. In one, it specifies a sequence of conjunction 
and existential quantification operations using a stack-based 
notation. This mode can be effective when the user has some 
problem-dependent strategy for solving a particular problem. 
In the other form, it identifies sets of clauses forming parity 
constraints. These constraints are converted into TBDDs and 
simplified using Gaussian elimination. In some cases, a TBDD 
with root node Tọ will be generated while processing the 
schedule file. That indicates the formula is unsatisfiable and 
the proof of unsatisfiability will be complete. Otherwise, the 
TBDDs remaining, including those of unused input clauses, 
are processed using bucket elimination. When no schedule file 
is provided, all clauses are processed in this manner. 

Bucket elimination [8], [9], [34] processes the TBDDs 
according to some ordering of the variables. Our imple- 
mentation makes the simplifying assumption that buckets 
are ordered according to the BDD variable ordering, with 
bucket ¿ associated with input variable x;. Each TBDD is 
stored in a list (the “bucket’) according to its root node 
variable. Buckets are processed from the least to the greatest. 
For bucket 7, a conjunction of the TBDDs in the bucket is 
computed to yield TBDD u;. A new BDD is computed as 
vi = Low(u,;) V High(u;), existentially quantifying x; from 
ui. This BDD is validated using TBDD w,, since any Boolean 
function f and variable x satisfies f — Jx f. The resulting 
TBDD vù; is then placed in the bucket corresponding to its 
root node variable. This process continues until either 1) the 
TBDD To is generated, or 2) all buckets are processed with 
the final step yielding v,, = T. In the former case, the formula 
is unsatisfiable and the unsatisfiability proof is complete. In 
the latter case, the formula is satisfiable and the next task is 
to generate one or more solutions. 

To generate a solution, the solver starts with an empty 
assignment and works in reverse order, adding assignments 
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to variables £ through z1. Let a&n+ı = Ø. For bucket i, it 
can assume that a;+1 satisfies v;, and we must assign a value 
to x;. Let uw; = High(u;) and uo = Low(u;). Assignment a 
must satisfy at least one of these. In the event that just u; is 
satisfied, assign 1 to x;. If just uo is satisfied, then assign 0 
to x;. Otherwise, x; can be assigned an arbitrary value. No 
further BDD generation is required to find a solution. 

To generate a solution where some of the variables have 
been eliminated by Gaussian elimination, the solver first con- 
tinues the elimination process to simplify the intermediate par- 
ity constraints via Gauss-Jordan elimination [31]. It uses BDD 
representations of these constraints to generate assignments for 
the internal variables. To generate multiple solutions, a new 
clause is created as the negation of the generated assignment, 
and the buckets are reprocessed in forward order. If this 
processing yields BDD node To, then no further solutions 
exist. Otherwise, the bottom-up generation of an assignment 
will be guaranteed to find a new solution. 


V. EXPERIMENTAL EVALUATION 


As a general purpose SAT solver, TBSAT is no match for 
state-of-the-art CDCL solvers. Among benchmarks used in 
recent SAT competitions, it succeeds only on the TSEITIN- 
GRID parity constraint problems [35]. On the other hand, it 
handles classes of problems for which CDCL solvers fare 
poorly. BDD-based approaches can best complement CDCL, 
rather than compete with it. 

Table II shows the performance of proof-generating SAT 
solvers on several scalable, unsatisfiable challenge problems. 
It compares different operating modes of TBSAT to KISSAT, 
a state-of-the-art CDCL solver [36]. It shows a progression 
of problem sizes, with the most difficult benchmark for 
one approach becoming the starting point for the next. All 
experiments were performed on a 3.2 GHz Apple M1 Max 
processor with 64 GB of memory and running the OS X 
operating system. The proofs were checked using DRAT-TRIM 
for the proofs generated by KISSAT and LRAT-CHECK for 
those generated by TBSAT. For LRAT proofs over 500 million 
clauses, we used a modified version of LRAT-CHECK that 
better exploits the sparseness in the proof structure that arises 
when a large fraction of the clauses is deleted. The column 
labeled “SAT Time” indicates the time (in seconds) taken by 
the solver, and the column labeled “Check Time” indicates the 
time taken by the checker. The column labeled “Proof Clauses” 
indicates the number of clauses in the generated proof. Entries 
marked “—” indicate a failure by the program to complete. 
The following benchmark problems were evaluated: 

e Mutilated chessboard: Tile an n x n chessboard with 
dominos. Two opposite corners are removed from the 
chessboard, making the task impossible [37]. The prob- 
lem size, in terms of the number of variables and clauses, 
scales as O(n?). 

e Pigeonhole: Assign n+1 pigeons to n holes such that no 
hole contains more than one pigeon [38]. The at-most- 
one constraints are encoded using auxiliary variables [39]. 
The problem size scales as O(n”). 


TABLE II 
PERFORMANCE OF KISSAT AND TBSAT ON UNSATISFIABLE CHALLENGE PROBLEMS 


Solver Method Problem Size Variables Clauses SAT Time Proof Clauses Check Time 
Mutilated Chessboard 
KISSAT CDCL 16 476 1,592 358.7 12,621,694 618.5 
KISSAT CDCL 18 608 2,044 1314.9 38,083,824 1295.8 
TBSAT Column scan 18 608 2,044 0.1 111,163 0.1 
TBSAT Column scan 368 270,108 943,544 898.2 568,261,363 568.8 
Pigeonhole 
KISSAT CDCL 13 351 508 1116.1 66,263,560 2041.8 
KISSAT CDCL 14 406 589 6077.2 331,858,919 — 
TBSAT Column scan 14 406 589 0.1 92,687 0.1 
TBSAT Column scan 254 129,286 193,549 898.5 898,819,648 993.5 
Chew-Heule parity formulas 
KISSAT CDCL 40 114 304 334.3 29,133,644 594.2 
KISSAT CDCL 44 126 336 3103.6 227,489,490 8254.9 
TBSAT Bucket elim. 44 126 336 0.1 24,492 0.1 
TBSAT Bucket elim. 8,666 25,992 69,312 894.7 505,637,209 523.4 
TBSAT Gauss. elim. 8,666 25,992 69,312 4.6 5,066,914 52 
TBSAT Gauss. elim. 699,051 2,097,147 5,592,392 645.3 575,600,179 656.1 
Urquhart-Li parity formulas 
KISSAT CDCL 3 153 408 — — — 
TBSAT Bucket elim. 3 153 408 0.1 38,598 0.1 
TBSAT Bucket elim. 35 25,305 67,480 784.6 349,400,890 230.8 
TBSAT Gauss. elim. 35 25,305 67,480 3.8 4,232,657 4.3 
TBSAT Gauss. elim. 316 2,093,184 5,581,824 529.3 484,548,938 346.9 


e Chew-Heule: Enforce both odd and even parity con- 
straints on the n input variables. Each constraint is 
encoded linearly using n — 1 auxilliary variables, with 
the second constraint using a random permutation of the 
variables [40]. The problem size scales as O(n). 

e Urquhart-Li: A parity constraint problem devised by 
Urquhart [41], defined over a bipartite graph with 2m? 
nodes. The problem size scales as O(m?°). We use the 
benchmark generator implemented by Li [42]. 


The formulas were evaluated for different values of the scaling 
parameter n or m. Runs of TBSAT were limited to 900 
seconds—longer runs generally produced proofs that exceeded 
the capacity of the proof checker. KISSAT was allowed to run 
for up to 7200 seconds. 

The limitations of CDCL solvers for these problems are 
clearly indicated by the results for KISSAT. It can only 
handle relatively small instances. We also found that allowing 
longer run times does not have a significant effect, due to 
the exponential scaling. For example, KISSAT completes the 
mutilated chessboard problem for n = 16 in 360 seconds, 
but once it reaches n = 20, the solver runs for over two 
hours without completing. Similarly, KISSAT completes the 
pigeonhole problem for n = 12 in just 42 seconds, but once 
it reaches n = 14, it requires nearly 1.7 hours and generates 
a proof that is too large for DRAT-TRIM to check. For the 
Chew-Heule formulas, KISSAT can only complete n < 44 
within the 7200-second time limit. We ran KISSAT for over 16 
hours on the smallest instance of the Urquhart-Li benchmark, 
having m = 3, but it did not complete. It is remarkable that 
a problem with just 153 variables and 408 clauses could be 
so challenging for CDCL solvers. 


By contrast, TBSAT achieves polynomial scaling for all 
four benchmarks. In earlier work [15], we presented column 
scanning to efficiently generate unsatisfiability proofs of the 
mutilated chessboard and pigeonhole problems. This approach 
performs a sequence of conjunction and quantification steps 
to effectively sweep through the columns of the chessboard or 
the pigeons in the pigeonhole problem in a manner inspired 
by symbolic model checking. TBSAT can also apply column 
scanning, easily handling the limiting instances for KISSAT. It 
can scale to n = 368 for the mutilated chessboard problem 
and to n = 254 for the pigeonhole problem within the 900- 
second time limit. Even though the generated proofs are very 
large, they can be verified by the modified version of LRAT- 
CHECK. It remains to be seen whether column scanning can 
be made more general and with automatic generation of the 
schedule and variable order. 


TBSAT can apply bucket elimination to the two parity 
problems with good effect. It can easily handle the limit- 
ing instances for KISSAT, and it scales to the Chew-Heule 
benchmark for n < 8666 and the Urquhart-Li benchmark for 
m < 35 within a 900-second time limit. 


Perhaps the most striking results are those using Gaussian 
elimination. By exploiting the sparse structure of the formulas, 
TBSAT can solve very large instances of the Chew-Heule and 
Urquhart benchmarks quickly. The limiting factor for both of 
these problems is that BUDDY allocates only 21 bits for the 
level field in each BDD node (Figure 2(A)), limiting it to 
to a maximum of 27! — 1 (2,097,151) input variables. This 
prevents it from going beyond n = 699,051 for Chew-Heule 
and m = 316 for Urquhart, each having over two million input 
variables and five million clauses. Obtaining these results 
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Fig. 3. Elapsed times (in seconds) for different solvers and solution methods on Chew-Heule parity formulas, as function of problem size n 


requires no guidance for the user, and it is insensitive to the 
BDD variable ordering. 

Figure 3 presents more runtime data for the Chew-Heule 
parity formula benchmark as a function of problem size n, 
enabling us to compare the relative performance and scaling 
of different solvers and solution methods. The red lines show 
three different versions of solving via bucket elimination. 
The top red line shows the performance of our prototype 
solver PGBDD, while the middle line shows the times for 
TBSAT. As can be seen, TBSAT consistently ran 10-12 x faster. 
This can be attributed to the advantage of compiled C/C++ 
code versus interpreted Python. The lower red line shows the 
performance of TBSAT when proof generation is not required. 
This mode performs only the conjunction and quantification 
BDD operations, without generating proof clauses or writing 
them to a file. For smaller values of n, the runtime can be up 
to 33x faster, but this advantage drops to just a factor of 2x 
for larger values. For large values of n, the cost of garbage 
collection becomes a more dominant concern. 

The data shown in green give results for three different 
versions of solving via Gaussian elimination. The data points 
at the top show the performance of our prototype pseudo- 
Boolean solver PGPBS. We found that the runtimes and 
generated proof sizes varied widely depending on the random 
permutation of the second parity constraint, and so the plot 
shows the raw data for five different random seeds for each 
value of n, including timeouts. The variation depends on 
whether or not the greedy pivot selections kept the constraints 
sparse. The middle green line shows the performance of TBSAT 
using Gaussian elimination. As noted before, it scales very 
well, nearly reaching its upper limit of n = 699,051 within 
the 600-second time limit. Compared to even the best data 
points for PGPBS, we see that TBSAT achieves much better 
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scaling despite using very similar algorithms. However, like 
PGPBS, its ability to maintain sparseness depends on both 
the particular permutation of the second parity constraint, as 
well as the random tie breaking done during pivot selection. 
Consequently, some data points yielded timeouts. The lower 
green line shows the performance of TBSAT using Gaussian 
elimination, but without proof generation. In this mode, it 
need not perform any BDD operations and hence can be very 
fast, reaching a maximum of 15.3x faster for n = 3,000, but 
dropping off to 6.2x as n approaches its limiting value. 

Overall, these measurements show that 1) TBSAT greatly 
outperforms the prototype implementations, 2) adding proof 
generation can slow performance considerably, but the penalty 
diminishes for larger benchmarks, 3) Gaussian elimination 
greatly increases the speed and capacity of the solver for parity 
constraint problems, and 4) careful pivot selection is required 
to maintain sparseness during Gaussian elimination. 


VI. CONCLUSIONS AND ACKNOWLEDGEMENTS 


The TBUDDY library provides a powerful framework for 
creating automated reasoning tools that generate proofs of 
correctness. Building on an established BDD package, it can 
generate clausal proofs justifying the correctness of each step 
in its recursive algorithms. The TBSAT solver is especially 
strong for handling problems with parity constraints. We 
have also incorporated its proof-generation capability into a 
CDCL solver that uses Gauss-Jordan elimination for parity 
reasoning [43]. We anticipate implementing other automated 
reasoning tools using TBUDDY. 

Thanks to Marijn Heule for his continued advice and for 
creating a high capacity version of LRAT-CHECK. This work 
was supported by the U. S. National Science Foundation under 
grant CCF-2108521. 
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Abstract—Our recently proposed certification framework for 
bit-level k-induction-based model checking has been shown to 
be quite effective in increasing the trust of verification results 
even though it partially involved quantifier reasoning. In this 
paper we show how to simplify the approach by assuming reset 
functions to be stratified. This way it can be lifted to word-level 
and in principle to other theories where quantifier reasoning is 
difficult. Our new method requires six simple SAT checks and 
one polynomial-time check, allowing certification to remain in 
co-NP while the previous approach required five SAT checks 
and one QBF check. Experimental results show a substantial 
performance gain for our new approach. Finally we present and 
evaluate our new tool CERTIFAIGER-WL which is able to certify 
k-induction-based word-level model checking. 


I. INTRODUCTION 


Over the past several years, there has been growing interest 
in system verification using word-level reasoning. Satisfiability 
Modulo Theories (SMT) solvers for the theory of fixed- 
size bit-vectors are widely used for word-level reasoning [1], 
[2]. For example, word-level model checking has been an 
important part of the hardware model checking competitions 
since 2019. Given the theoretical and practical importance of 
word-level verification, a generic certification framework for it 
is necessary. As quantifiers in combination with bit-vectors are 
challenging for SMT solvers and various works have focused 
on eliminating quantifiers in SMT [2]-[4], a main goal of this 
paper is to generate certificates without quantification. 

Temporal induction (also known as k-induction) [5] is a 
well-known model checking technique for verifying software 
and hardware systems. An attractive feature of k-induction 
is that it is natural to integrate it with modern SAT/SMT 
solvers, making it popular in both bit-level model checking 
and beyond [6]-[8], including word-level model checking. 

Certification helps gaining confidence in model checking 
results, which is important for both safety- and business- 
critical applications. There have been several contributions 
focusing on generating proofs for SAT-based model checking 
[9]-[15]. For example [16] and [14] proposed an approach to 
certify LTL properties and a few preprocessing techniques by 
generating deductive proofs. In this paper, we focus on finding 
an inductive invariant for k-induction. Unlike other SAT/SMT- 
based techniques such as IC3 [17] and interpolation [18], 
[19], &-induction does not automatically generate an inductive 
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invariant that can be used as a certificate [20]. In previous 
research [21], certification of k-induction can be achieved via 
five SAT checks together with a one-alternation QBF check, 
redirecting the certification problem to verifying an inductive 
invariant in an extended model that simulates the original one. 

At the heart of the present contribution is the idea of 
reducing the certification method of k-induction to pure SAT 
checks, i.e., eliminating the quantifiers. This enables us to 
complete the certification procedure at a lower complexity, and 
to directly apply the framework to word-level certification. We 
introduce the notion of stratified simulation which allows us 
to reason about the simulation relation between two systems. 

This stratified simulation relation can be verified by three 
SAT and a polynomial-time check. The latter checks whether 
the reset function is indeed stratified. In addition, we present a 
witness circuit construction which simulates the original under 
the stratified simulation relation thus creating a simpler and 
more elegant certification construction for k-induction. 

While the previous work only focused on bit-level model 
checking, we also lift our method to word-level by imple- 
menting a complete toolsuite CERTIFAIGER-WL, where the 
experiments show the practicality and effectiveness of our 
certification method for word-level models. 


II. BACKGROUND 


This paper extends previous work in certification for k- 
induction-based bit-level model checking [21]. In this section, 
we present essential concepts and notations. 

For the sake of simplicity we work with functions rep- 
resented as interpreted terms and formulas over fixed but 
arbitrary theories which include an equality predicate. We 
further assume a finite sorted set of variables L where each 
variable l € L is associated with a finite domain of possible 
values. We also include Boolean variables as variables with a 
domain of {T, L}, for which we keep standard notations. 

For two sets of variables J and L, we also write J, L 
to denote their union. Given two functions f(V), g(V’) 
where V C V’ (represented as interpreted terms over our 
fixed but arbitrary theories) we call them equivalent, written 
f(V) = g(V’), if for every assignment to variables in V 
and V’ that matches on the shared set of variables V, the 
functions f(V), g(V’) have the same values. Additionally, we 
use “œ” for syntactic equivalence [22], “—” for syntactic 
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Fig. 1: An outline of the certification approach. Given some 
value of k and a model C, C” is the resulting witness circuit. 
The coloured area is specific to our approach for k-induction, 


and the rest corresponds to the general certification flow. 


implication, and “=” for semantic implication. To define 
semantical concepts or abbreviations we stick to equality “=”. 
We use vars( f) to denote the set of variables occurring in the 
syntactic representation of a function f. 

In word-level model checking operations are applied to 
fixed-size bit-vectors. We introduce the notion of word-level 
circuits where we model inputs and latches as finite-domain 
variables. 


Definition 1 (Circuit). A circuit is a tuple C = (I, L, R, F, P) 
such that: 
e I isa finite set of input variables. 
e Lisa finite set of latch variables. 
e R={r,(L) | lE L} is a set of reset functions. 
e F={f,U,L) |l€ L} is a set of transition functions. 
e P(I, L) is a function that evaluates to a Boolean output, 
encoding the (good states) property. 


By Def. 1 a circuit represents a hardware system in a fully 
symbolic form. In order to talk about the reset functions of a 
subset of latches L” C L, we also write 


RL) = N de rb). 
leL” 
The following four definitions are adapted from our previous 
work [21] for completeness of exposition. 


Definition 2 (Unrolling). For an unrolling depth m € N, 
the unrolling of a circuit C = (I,L,R,F,P) of length m 
is defined as Um = N (Liga & FUi, Li)). 

i€[0,m) 


Definition 3 (Inductive invariant). Given a circuit C with a 
property P, é(I, L) is an inductive invariant in C if and only 
if the following conditions hold: 


1) R(L) = 67, L), “initiation” 
2) (I, L) = PUI, L), and “consistency” 
3) Ur A bo, Lo) > o(h, Li). “consecution” 


As a generalisation of the notion of an inductive invariant, 
k-induction checks k steps of unrolling instead of 1. In the 
following, to verify that a property is an inductive invariant, 
we consider it as the special case of k-induction with k = 1 
and ọ(I, L) = P(I, L). 


Definition 4 (k-induction). Given a circuit C with a property 
P, P is called k-inductive in C if and only if the following 
two conditions hold: 
1) Uk—ı TAN R(Lo) = \ P(I;, Li), and 
i€[0,k) 
2) Uk A. PUG, Li) > P(Ik, Le). 
i€[0,k) 


“BMC” 


“consecution” 


Definition 5 (Combinational extension). 
A circuit C" = (T', L’, R', F’, P’) combinationally extends a 
circuit C= (I, L, R,F, P) if I=T' and LC L. 


III. CERTIFICATION 


In this section we introduce and formalise our certification 
approach which reduces the certification problem to six SAT 
checks and one polynomial stratification check. 

The certification approach is outlined in Fig. 1. Intuitively, 
a witness circuit is generated from a given value of k (pro- 
vided by the model checker) and a model (either bit-level or 
word-level). The witness circuit simulates the original circuit 
while allowing more behaviours (we formally define it as the 
stratified simulation relation). In practice, the witness circuit 
would be required to be provided by model checkers as the 
certificate in hardware model checking competitions. 

We also perform a polynomial-time stratification check on 
the witness circuit. The check requires that the definition of the 
reset function is stratified, i.e., no cyclic dependencies between 
the reset definitions of the variables exist. This is the case for 
all hardware model checking competition benchmarks. Even 
though cyclic definitions have been the subject of study in 
several papers [23]-[25], they are usually avoided due to the 
complexity of their analysis and subtle effects on semantics. 

The approach in [21] can handle cyclic resets but at the 
cost of QBF quantification, and thus [21] not being able to 
be efficiently adapted to the context of word-level verification. 
Furthermore, the witness circuit includes an inductive invariant 
which serves as a proof certificate, which is verified by another 
three SAT checks as defined in Def. 3. 

We begin by defining stratified reset functions. 


Definition 6. (Dependency graph.) Given a set of latches L 
and a set of reset functions R = {r; | l € L}, the dependency 
graph Gr has latch variables L as nodes and contains a 
directed edge (a,b) from a to b iff a € vars(rp) and ry Æ b. 


Latches with undefined reset value are common in applica- 
tions. We simply set rẹ = b for some uninitialised latch b in 
such a case (as in AIGER and BTOR) to avoid being required 
to reason about ternary logic or partial functions. Thus the 
syntactic condition “r, Æ b” in the last definition simply avoids 
spurious self-loops in the dependency graph for latches with 
undefined reset values. 


Definition 7. (Stratified resets.) Given a set of latches L, and 
a set of reset functions R = {r; | | € L}. R is said to be 
Stratified iff Gr is acyclic. 
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TABLE I: Summary of certification results for the bit-level TIP suite. 


Pinit Peonsist Pconsec Ptrans Pprop Preset 
Benchmarks ty t2 ty t2 ty t2 ti t2 ty t2 ty t2 
c.periodic 7.78 0.06 0.06 0.06 56.82 56.29 0.15 0.14 0.05 0.05 84.04 0.00 
n.guidance, 0.19 0.01 0.01 0.01 3.73 3.79 0.12 0.12 0.01 0.01 1,21 0.00 
n.guidance7 4.09 0.02 0.02 0.02 18.40 18.17 0.12 0.12 0.02 0.02 25.22 0.00 
n.tcasp2 0.17 0.01 0.01 0.01 2.64 2.68 0.23 0.23 0.01 0.02 1.79 0.00 
n.tcasp3 0.11 0.01 0.01 0.01 1.82 1.70 0.23 0.26 0.02 0.02 1.01 0.00 
v.prodcelli2 2.35 0.03 0.03 0.03 59.05 59.22 0.12 0.12 0.03 0.03 8.48 0.00 
v.prodcell;3 0.22 0.01 0.01 0.01 2.99 2.99 0.12 0.12 0.01 0.01 0.20 0.00 
v.prodcelli4 0.64 0.02 0.02 0.02 13.69 13.69 0.12 0.12 0.02 0.02 1.45 0.00 
v.prodcelli5 2.22, 0.02 0.03 0.03 32.66 32.28 0.12 0.12 0.02 0.02 2.26 0.00 
v.prodcelli6 0.01 0.01 0.01 0.01 1.19 1.20 0.12 0.12 0.01 0.01 0.06 0.00 
v.prodcelli7 2.34 0.03 0.03 0.03 48.51 48.17 0.12 0.12 0.03 0.03 6.86 0.00 
v.prodcellig 0.67 0.01 0.01 0.01 8.67 8.78 0.12 0.12 0.02 0.02 0.79 0.00 
v.prodcelli9 1.66 0.02 0.02 0.03 31.98 31.78 0.12 0.12 0.03 0.03 3.73 0.00 
v.prodcella4 3.32 0.04 0.04 0.04 112.12 115.18 0.12 0.12 0.04 0.04 17.64 0.00 


Columns report the benchmark names, and the time (in seconds) used for each SAT check by CERTIFAIGER (t1) and CERTIFAIGER++ (t2) respectively. 
Interestingly, the SAT solving time for the new reset check is close to zero, which checks the equality of the reset functions between the shared set of 
latches and the latches in the original circuit. This is because all latches in the benchmark set are initialized to L, thus making the SAT checks rather trivial. 


Definition 8. (Stratified circuit.) A circuit C = (I, L, R, F, P) 
is said to be stratified iff R is stratified. 


The stratification check can be done in polynomial time 
using Def. 7 and it is enforced syntactically in the two 
hardware description formats AIGER and BTOR2. 


Definition 9. (Stratified simulation.) Given two stratified cir- 
cuits C and C", where C’ combinationally extends C. There 
is a stratified simulation between C" and C iff, 


1) (L) =r)(L’) for le L, “reset” 
2) fill, L) = fi, L’) for Le L, and “transition” 
3) P'(I,L') > PU, L). “property” 


In essence, the crucial change here compared to the combi- 
national simulation definition in [21] is the reset condition, 
whose simplification was possible under the stratification 
assumption. The above three conditions are encoded into 
SAT/SMT formulas (Preset, Ptranss prop In Fig. 1) which 
are then checked by a solver for validity. In the rest of the 
paper, we simply refer to the stratified simulation relation as 
simulation relation. Proofs of the presented theoretical results 
can be found in an extended version of this paper [26]. 


Theorem 1. Given two circuits C and C’', where C” simulates 
C. If C’ is safe, then C is also safe. 


Next, we introduce the witness circuit construction. This is 
similar to the construction in [21] but differs in several details, 
e.g., the reset function definition is stratified and significantly 
simplified compared to [21]. 


Definition 10. (Witness circuit.) Given a circuit C 
(1, L,R,F,P) and an integer k € N*, its witness circuit 
C = (T', L', R', F', P’) is defined as follows: 
1) I' =I (also referred to as X*~), 
2) = Le1u.---UD°UX*®?U---U X°UB where, 
e L*-! = L, the other variables sets are copies of I 


and L respectively with the same variable domains. 
e B={bk-1,... bO} are Booleans. 
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3) R': 
e forle LE! ri = r( L71). 
e forle LU.. ULET? U X? U. U XE r =l. 
e tas =T. 
e forie f0, k- 1), ry =L. 
4) F': 


e for le De = PLEI; 
e Jpeca = bF}, 
© for ic [0,k- 1), € (LU XŻU {6}), fh = TH, 


5) P'= N p(l, L’) where 
i€ [0,4] 

e pl, L) = NA (bb). 
i€[0,k—-1) 

en L)Y= N (> (L ~ F(X',1*))). 
t€[0,k-1) a a: 

e p(l, L')= A (b> P(X, L)). 
i€[0,k) : 

e p3(l’,L')= N (Ab 1 Ad) > R(L’)). 
i€[1,k) 

e p(T, L) = b, 


Here we extend a given circuit to a witness circuit, which 
has k copies of the original latches and inputs, and additional k 
latches of B that we refer to as the initialisation bits. We refer 
to the {k — 1}th as the most recent, and the Oth as the oldest. 
Intuitively the most recent copy unrolls in the same way as 
the original circuit, with the older copies copying the previous 
values of the younger copies. When all initialisation bits are 
T, we say the machine has reached a “full initialisation” state. 


Lemma 1. Given a circuit C with reset function R and its 
witness circuit C” with reset function R'. If R is stratified, 
then R' is also stratified. 


Theorem 2. Given a circuit C and its witness circuit C". C" 
simulates C. 


We now present the main theorem of this paper. 
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Fig. 2: Bit-level: the experimental results of the HWMCC 2010. The benchmark names are shown on the x-axis. The 
time ratio on the y-axis is calculated by computing certification time divided by model checking time (ran on the model 
checker McAiger [27]). The black dots in the graph are the results obtained from CERTIFAIGER++ and the grey dots are 
from CERTIFAIGER. The straight line and the dashed line are the calculated means for CERTIFAIGER++ and CERTIFAIGER 
respectively. As we can see from the plot, especially for the instances with certification time greater than 500 seconds, the 
new implementation significantly improved the certification performance. 


Theorem 3. Given a circuit C = (I,L,R,F,P) and its 
witness circuit C’ = (T', L', R', F', P'). P is k-inductive in 
C iff P' is l-inductive in C". 


IV. IMPLEMENTATION AND EXPERIMENTAL EVALUATION 


We implemented the proposed certification approach into 
two complete toolkits [28]: CERTIFAIGER++ for bit-level, 
and CERTIFAIGER-WL for word-level. We evaluate the per- 
formance of our tools against several benchmark sets from 
previous literature and the model checking competitions. 


A. Bit-level 


Our toolkit CERTIFAIGER++ extends the certification toolkit 
CERTIFAIGER [21]. Note that the AIGER format only allows 
stratified resets by default. All experiments were performed on 
a workstation with an Intel® Core™ i9-9900 CPU 3.60GHz 
computer with 32GB RAM running Manjaro with Linux 
kernel 5.4.72-1. 

To determine the speedups of the new implementation 
proposed in this paper, we performed experiments on the same 
sets of the benchmarks used in [21]. The results are reported 
in Table I. There are significant overall gains in the initiation 
checks (Yiniz) as well as the reset checks (Preset). For the 
initiation check which checks the invariant holds in all initial 
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states, the performance improvement is largely due to the 
simplification of the reset functions in the new witness circuit 
construction. 

The results in Fig. 2 demonstrate that CERTIFAIGER++ in 
general is much faster than CERTIFAIGER during the overall 
certification process. Compared to CERTIFAIGER, CERTI- 
FAIGER++ achieved overall speedups of 2.46 times. We ob- 
serve performance gains in most benchmarks, as the previous 
performance bottleneck for certain benchmarks is the QBF 
solving time for the reset check. For other instances, the 
bottleneck is the SAT solving time for the consecution check, 
which is also improved due to a simpler reset construction (as 
part of the inductive invariant). 


B. Word-level 


We further lifted the method to certifying word-level model 
checking by implementing an experimental toolkit called 
CERTIFAIGER-WL. CERTIFAIGER-WL follows the same archi- 
tecture design as CERTIFAIGER++ and uses Boolector [29] as 
the underlying SMT solver. All models and SMT encodings 
are in BTOR2 [29] format, which is the standard word-level 
model checking format used in hardware model checking 
competitions. 


TABLE II: Summary of certification results word-level benchmarks from the HWMCC20 


Benchmarks k  #model #witness ModelCh. Certifi. Consec. Ratio 
paper_v3 256 35 12801 10.25 1.14 0.90 0.11 
VexRiscv-regch0-15-p0 17 2149 43077 10.31 4.04 3.29 0.39 
zipcpu-pfcache-p02 37 1818 105874 13.95 4.40 2.73 0.32 
zipcpu-pfcache-p24 37 1818 105874 14.35 4.49 2.83 0.31 
zipcpu-busdelay-p43 101 950 145466 15.29 6.14 3.86 0.40 
dspfilters_fastfir_second-p42 15 6732 115388 16.11 14.80 12.96 0.92 
zipcpu-pfcache-p01 41 1818 117434 18.33 6.34 4.47 0.35 
dspfilters_fastfir_second-p10 11 6732 84348 24.56 9.76 8.44 0.40 
zipcpu-busdelay-p15 101 950 145466 58.17 8.18 5.89 0.14 
qspiflash_dualflexpress_divfive-p120 97 3100 394412 63.58 22.07 14.58 0.35 
zipcpu-pfcache-p22 93 1818 267714 166.07 23.66 19.06 0.14 
VexRiscv-regch0-20-p0 22 2149 55862 240.50 16.76 15.76 0.07 
dspfilters_fastfir_second-p14 15 6732 115388 354.01 21.27 19.44 0.06 
dspfilters_fastfir_second-p11 21 6732 161948 627.69 46.88 44.30 0.07 
dspfilters_fastfir_second-p45 17 6732 130908 1094.11 30.14 28.06 0.03 
VexRiscv-regch0-30-p1 32 2150 81464 1444.47 83.38 81.95 0.06 
dspfilters_fastfir_second-p43 19 6732 146428 2813.61 58.02 55.69 0.02 


To select the benchmarks presented, we first ran AVR with a timeout of 5000 seconds. We display the results here that are of particular interest with a 
running time of more than ten seconds (there are 7 instances with k = 1 which were certified and solved under 0.2s). Columns report the benchmark names, 
the value of k, the size of the model (measured in number of instructions) and the generated witness, the model checking time, and certification time (in 
seconds). Additionally we list the time Boolector took to solve the consecution check, as well as the ratio of model checking vs. certification time. We only 
list the consecution check (Consec.) here as it takes up the majority of the certification time. 


Model Checking 
Certification 


200 400 


600 800 1000 


Fig. 3: Word-level: model checking vs. certification time for the Counter example (with 500 bits) with increasing values 
of k. For the experiments, we fixed the modulo bound at 32 and scaled the inductive depth up to 1000. The certification 
time is significantly smaller than the model checking time. As the value of k increases, on average the certification time is 


proportionally lower. 


We ran benchmarks of the Counter example [21] on 
AVR [30] to get the values of k. Fig. 3 shows the experimental 
results obtained with CERTIFAIGER-WL under the same setting 
as Section IV-A. Interestingly, the certification time is much 
lower than the model checking time as can be seen in the 
diagram, meaning certification is at a low cost. 


In Table II we report the experimental results obtained 
on a superset of the hardware model checking competition 
2020 [31] benchmarks. We observe that the certification time is 
much lower than model checking time. Including certification 
would increase the runtime of AVR on the model checking 
benchmarks by less than 6%. 
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V. CONCLUSION AND FUTURE WORK 


We have presented a new certification framework which 
allows certification for k-induction to be done by six SAT 
checks and a polynomial-time check. We further lifted our 
approach to word-level, and implemented our method in both 
contexts. Experimental results demonstrate the effectiveness 
and computational efficiency of our toolkits. The removal of 
the QBF quantifiers has reduced the theoretical complexity of 
the problem compared to [21] and also reduced the overall 
runtime overhead of the certification. Additionally, in future 
work we plan to obtain formally verified certificate checkers 
by using theorem proving. Finally, how to certify liveness 
properties is another important avenue of further research. 
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Abstract—Satisfiability modulo theories (SMT) solvers are 
widely used to prove security and safety properties of computer 
systems. For these applications, it is crucial that the result 
reported by an SMT solver be correct. Recently, there has been 
a renewed focus on producing independently checkable proofs 
in SMT solvers, partly with the aim of addressing this risk. 
These proofs record the reasoning done by an SMT solver and 
are ideally detailed enough to be easy to check. At the same 
time, modern SMT solvers typically implement hundreds of 
different term-rewriting rules in order to achieve state-of-the-art 
performance. Generating detailed proofs for applications of these 
rules is a challenge, because code implementing rewrite rules can 
be large and complex. Instrumenting this code to additionally 
produce proofs makes it even more complex and makes it harder 
to add new rewrite rules. We propose an alternative approach to 
the direct instrumentation of the rewriting module of an SMT 
solver. The approach uses a domain-specific language (DSL) to 
describe a set of rewrite rules declaratively and then reconstructs 
detailed proofs for specific rewrite steps on demand based on 
those declarative descriptions. 


I. INTRODUCTION 


Satisfiability modulo theories (SMT) solvers are widely 
used to reason about the security and safety of critical sys- 
tems [1, 2, 10, 13]. These applications require a high level 
of trust in the correctness of the underlying solver. SMT 
solvers, however, are complex pieces of software, in some 
cases consisting of hundreds of thousands of lines of code. 
As with any other large and complex software project, they 
are not immune to bugs [17], which may, in the worst case, 
cause incorrect results. Due to the size and complexity of 
SMT solvers and the fact that most of them continue to be 
in active development, their full verification is currently still 
out of reach. As a consequence, the best one can do is to 
check their individual answers based on evidence provided by 
the solvers themselves. 

For quantifier-free inputs reported to be satisfiable, SMT 
solvers are typically capable of producing as evidence a 
satisfying model, which can then be used to validate the claim. 
Note that for quantified formulas, model validation for satis- 
fiable queries is usually still possible although more complex. 
For unsatisfiable inputs, there have been efforts in recent years 
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towards producing independently checkable proofs, which 
record the reasoning steps required to deduce unsatisfiability. 
These steps can later be replayed and checked efficiently by 
a proof checker. Proofs can also be used to automatically 
discharge proof obligations in interactive theorem provers such 
as Coq [25] and Isabelle [19]. For this use case, the SMT 
solver acts as an automated tactic. The proof obligation is 
encoded as an SMT problem and the proof generated by the 
SMT solver is then used, in essence, to reconstruct a proof in 
the proof assistant’s native proof representation. 

Producing and checking proofs for unsatisfiable problems 
requires considerably more effort than generating and validat- 
ing models for satisfiable inputs. Additionally, proofs can be 
produced in many different forms, each with its own trade-offs. 
When it comes to the form of a proof, one characteristic of 
interest is the proof’s granularity. Fine-grained proofs enable 
efficient proof checking since the proofs are detailed enough 
to not require any search during checking. Similarly, proof 
reconstruction for interactive theorem provers requires detailed 
proofs to minimize holes that must be proved manually. How- 
ever, fine-grained proofs are generally more costly to produce. 
Coarse-grained proofs, on the other hand, are cheaper to 
produce but require more computation to check. Regardless of 
the proof form, the traditional approach for generating proofs 
is to instrument each component of the SMT solver to record 
its reasoning steps, and then consolidate the relevant recorded 
steps into a single proof. 

Instrumentation can be particularly challenging and tedious 
for the components of the solver that implement rewriting. 
Modern SMT solvers implement hundreds of rewrite rules 
for normalizing and simplifying terms to achieve state-of-the- 
art performance. Because rewriting is an essential part of the 
reasoning done by the solver, a proof must contain a record of 
the rewriting steps performed. Previous work [6] has described 
how to generate rewriting proofs whose only holes are atomic 
rewrites, i.e., an application of a single rewrite step to a single 
term. Such proofs use a single generic rule for all atomic 
rewrites. This approach has two major drawbacks, however: 
(i) the proof checker has to guess or search for the rule to 
apply or trust that the rewriting was done correctly; and (ii) if 
used in a proof assistant, each rewrite step becomes a proof 
obligation that must be discharged by the user. On the other 
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hand, if occurrences of atomic rewrites are proven using a 
fixed set of specific rules, we can prove the correctness of the 
rules in this set once and for all and then use those proofs 
during proof checking or during replay in a proof assistant. 


As mentioned above, instrumenting rewriting code for proof 
generation is difficult and tedious. Additionally, since rewriting 
is applied not only as a preprocessing step but also repeatedly 
during the solving process, rewriting code (including any 
instrumentation) must be efficient. In this work, we propose an 
alternative approach that does not rely on instrumenting the 
original rewriter. Instead, our approach treats the rewriter as a 
black box and relies on a post-processing phase to expand 
coarse-grained rewriting steps ocurring in proofs into fine- 
grained proofs. We use a generic reconstruction algorithm that 
consults a separate database of core rewrite rules in order 
to produce the detailed proof using as input only the terms 
before and after an atomic rewrite. The core rewrite rules 
need not include every atomic rewrite. It is enough for every 
atomic rewrite to be reconstructable using one or more of the 
core rewrite rules. This simplifies the task of populating the 
database, as the rules used can be fewer and simpler than 
what is actually done in the solver. To specify the set of rules 
in the database, we propose the use of a high-level, domain- 
specific language (DSL) designed to succinctly express a set 
of core rules to be used in proofs. We have used this approach 
to reconstruct detailed proofs for the theory of strings in the 
SMT solver Cvc5 [4]. In our experience, this approach greatly 
reduces the burden of proof production for rewriting code, 
as it allows a solver developer to quickly and incrementally 
define core rewrite rules to help fill holes in proofs. Also, note 
that rewrite steps are typically equality-preserving. Because 
we treat the rewriter as a black box (i.e., independently from 
any specific solver or implementation), our approach is quite 
general and could be used to produce or complete proofs for 
any tool or situation where proofs of equivalence are needed. 
By providing a DSL for specifying rewrites and an automatic 
reconstruction algorithm for coarse-grained atomic rewrites, 
we expect to greatly improve the flexibility and usability of 
proofs from SMT solvers. Our contributions are as follows: 


e We propose an SMT-LIB-like domain specific language 
for defining rewrite rules. 

e We describe an algorithm that can use such rules to 
reconstruct detailed proofs for rewrites in an SMT solver. 

e We implement our approach in CVC5 and report on a 
case study reconstructing detailed proofs for rewrites in 
the theory of strings. 

e We evaluate our implementation and show that it has 
reasonable performance in practice. 


In the remainder of the paper, we provide an overview of our 
approach (Section II) and then describe the language (Sec- 
tion III) and the proof reconstruction algorithm (Section IV) 
in more detail. We then present a case study of using the 
approach to produce detailed proofs for the theory of strings 
in CVC5 (Section V) and evaluate our approach (Section VI) 
experimentally. Finally, we conclude with some future direc- 


tions for the language and our approach (Section VII). 


A. Related Work 


Barbosa et al. [5] introduced a framework for modular- 
izing the production of proofs for formula processing and 
term rewriting, a long-standing challenge for SMT solvers. 
A similar and more general framework for overall proof 
production [6] was recently implemented in CvC5. However, 
both frameworks produce proofs that are coarse-grained with 
respect to atomic rewrites, i.e., each atomic rewriting step is 
a single proof step without further justification. 

In the integration between the veriT solver [11] and the 
Isabelle/HOL proof assistant [23], which leverages the frame- 
work from [5], the Sledgehammer tool [8] sends proof goals 
to veriT and then reconstructs proofs from those emitted by 
veriT in the Alethe proof format [22]. The reconstructed proofs 
can then be used to prove the original Isabelle/HOL proof 
goals. An initial version of this framework was similarly 
coarse-grained: every atomic rewrite applied by the solver was 
justified with a single Alethe proof rule. As shown by Schurr 
et al. [23], this led to failures and performance issues in the 
Isabelle/HOL reconstruction of Alethe proofs. One approach 
to address this issue is to extend the Alethe format to contain 
finer-grained rules for atomic rewrites, and to integrate each of 
these rules into both veriT and Sledgehammer. This has been 
shown to increase the success rate of proof reconstruction, but 
the process is fully manual: every new rule added requires 
updating the solver, the format, and the reconstruction. 

Notzli [20] proposed a language for rewrite rules in SMT 
solvers with the goal of automatically generating executable 
code that replaces parts of an existing rewriter. The DSL 
presented in this work is an evolution of that language and 
is focused on the needs of proof reconstruction. Our ded- 
icated rewrite language bears some similarity to equational 
specification languages such as Maude [12], ELAN [9], and 
CafeOBJ [14]. In contrast to those more general-purpose 
languages, the DSL presented in this work has a much more 
narrow scope and includes specific features to support its use 
in proof reconstruction. 


B. Formal Preliminaries 


We formalize our work within the setting of many-sorted 
logic with equality (see e.g., [15, 26]). Let S be a set of sort 
symbols. For every sort r € S, we assume an infinite set of 
variables of that sort. A signature X}, consists of a set USC S 
of sort symbols and a set ©! of function symbols. Constants 
are treated as 0-ary functions. We assume that X includes a 
sort Bool, interpreted as the Boolean domain, and the Bool 
constants T (true) and L (false). Signatures do not contain 
separate predicate symbols and use instead function symbols 
that return a Bool value. We further assume that for all sorts 
T € S, X contains an equality symbol ~: 7 x rT —> Bool, 
interpreted as the identity relation. Finally, we assume the 
usual definitions of well-sorted terms, literals, and formulas. 

A &-interpretation I maps: each T € X° to a distinct non- 
empty set of values 7? (the domain of r in I); each variable 
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Fig. 1: Overview of the components of our approach 


x of sort 7 to an element x! € 77; and each f7"'™7 € Xf 
to a total function ft: rf x... xX Tr} > 7! ifn > 0, 
and to an element in 7} if n 0. We use the usual 
notion of a satisfiability relation = between -interpretations 
and X-formulas. A X-theory T is a non-empty class of X- 
interpretations closed under variable reassignment (i.e., every 
interpretation that only disagrees with an interpretation in T 
on how it interprets variables is also in T). A X-formula ọ is 
T-satisfiable (resp., T-unsatisfiable, T-valid) if it is satisfied 
by some (resp., no, all) interpretations in T. We write =r y 
when o is T-valid. We say that pı T-entails p2, and write 


pı FT Ya, when Fr Y1 > p2. 


II. OVERVIEW 


In this paper, we assume a fixed theory T and consider only 
rewrite rules that preserve equivalence in T. Formally, let tļa 
denote the result of performing atomic rewrite a on term t. 
Then, we require that =r t + tl. 

Figure 1 shows an overview of our proposed approach. 
Modern SMT solvers implement a large number of theory- 
specific rewrite rules. Conceptually, the implementation of 
these theory-specific rewrite rules can be seen as theory 
rewriter modules of the individual theory solvers. A rewriter 
is a module that traverses a given term and invokes the 
appropriate theory rewriter on each subterm. To determine 
which theory rewriter to call, the rewriter looks at the top-most 
symbol of the subterm and calls the theory whose signature 
contains that symbol. The proof module, which manages 
proofs, utilizes the rewrite proof reconstructor to fill in the 
missing subproofs for rewrites. The rewrite proof reconstructor 
bases its reconstruction on a set of rewrite rules, stored in the 
rewrite rule database. This database is generated at compile- 
time from a set of rewrite rules written in our DSL RARE 
(described in Section II). These rewrite rules are stored in text 
files, which are compiled to C++ code using the DSL compiler. 
The compiled code populates a discrimination tree [16] which 
is an index used for matching terms with applicable rewrite 
rules during proof reconstruction. Assuming that the rewrite 
rules in the rewrite rule database are correct, our reconstruction 
is sound since only these rules are used to construct the proofs. 
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(rule) ( define-rule (symbol) ( (par)* ) 
(expr) (expr) ) 
| ( define-cond-rule (symbol) ( (par)* ) 
(expr) (expr) (expr) ) 
| ( define-rulex (symbol) ( (par)* ) 


(expr) (expr) [(expr)] ) 


(symbol) (sort) (attr)* 


? | (symbol) | ( (symbol) (sort)* ) 
| (_ (symbol) (idx)t ) 


? | (numeral) 
(const) | (id) | ( (id) (expr)*) | (let) 
(symbol) | ( _ (symbol) (idx)* ) 


( let ( (binding)t ) (expr) ) 


( (symbol) (expr) ) 


Fig. 2: Overview of the grammar of RARE. 


The output of the proof module consists of the proof with the 
subproofs for rewrites completed. 

The rule database may also play a role in proof checking. In 
particular, a stand-alone proof checker may use the database 
to automatically generate code that can check whether a rule 
in the database is used correctly. While the syntax is checked 
in this scheme, the T-validity of the rules in the database 
is trusted. Checking the rules for T-validity is another task 
which can (and should) be done separately, perhaps using a 
proof assistant. We do not address these issues in this paper, 
but instead focus on the RARE language and the algorithm at 
the core of the rewrite proof reconstructor. 


II. THE LANGUAGE 


In this section, we describe the scope, design goals, syntax, 
and semantics of RARE, our domain-specific language for 
rewrites, automatically reconstructed. To reduce the cost of in- 
troducing such a new language into the development workflow 
of an existing SMT solver, we identify several requirements: 


Succinctness: Writing rewrite rules should be simple and 
concise. Adding new rules should be far less costly than 
instrumenting existing code. 

Expressiveness: The language should be able to express the 
majority of the rewrite rules used in a state-of-the-art 
SMT solver. 

Accessibility: The language should be easy to parse and 
understand. 


There is an inherent tension between making a DSL succinct 
and making it expressive. We designed RARE to be as expres- 
sive as possible without sacrificing succinctness. To aid with 
accessibility, its syntax reuses the syntax of the SMT-LIB [7] 
language standard whenever possible. 


As we discuss in Section V, we do not aim for full 
generality, because certain rewrites, such as polynomial nor- 
malization, are less amenable to our approach. Similarly, we 
assume that constant folding is built into the reconstruction 
algorithm and therefore does not have to be explicitly defined 
with rewrite rules. 

An input file for RARE consists of a list of rewrite rules 
whose syntax is defined by the BNF grammar in Figure 2. 
Rewrite rules are written as S-expressions. For symbols and 
concrete constants (e.g., integer numbers, string literals), 
RARE uses the same syntax as the SMT-LIB language. In 
contrast to SMT-LIB, parameterized sorts such as arrays and 
bit-vectors do not need to be concrete. Instead, RARE is 
gradually typed and allows the parameters of such sorts to 
remain abstract. This allows users to specify rewrites that are, 
e.g., independent of the bit-width or the sorts of indices and 
elements in arrays. In the following, we discuss all the different 
constructs of the language in detail. 


Basic Syntax. As indicated in Figure 2, (rule) defines three 
different types of rewrite rules: basic rules (define-rule), 
conditional rules (define-cond-rule), and fixed-point rules 
(define-rulex). A basic rule consists of a name, a list 
of match parameters, the match expression, and the target 
expression. The name identifies the rewrite rule and is later 
used to label steps in the rewrite proof; the list of parame- 
ters (par)” introduces the term variables that appear in the 
tule, along with their sorts; the match expression defines the 
syntactic shape of terms the rewrite rule applies to; and the 
target expression defines how a matched term is rewritten. 
Both the match expression and the target expression have the 
same syntax as SMT-LIB terms. All the variables that appear 
in a rewrite rule must either be declared as a parameter or 
introduced locally with the let binder. 

Basic rules define simple rewrite rules without precondi- 
tions. The following example shows such a rule, which defines 
the rewrite substr("",m,n) ~~ "" from a term denoting the 
substring from position m to position n of the empty string 
to just the empty string, regardless of the value of m and n. 


(define-rule substr-empty 
(str. substr "* mn) MT) 


((m Tat) to Int) ) 


In this example, the match expression specifies that the rule 
applies to string terms of the form substr("",s,¢) where 
the first argument of substr is the empty string, the second 
argument s is matched by m, and the third argument ¢ is 
matched by n. The compiler and the proof reconstruction 
algorithm have built-in knowledge of theory symbols such as 
substr as defined in the SMT-LIB standard. 


Matching. If a variable x appears multiple times in a match 
expression, the rewrite rule only applies if each occurrence 
of x matches syntactically identical terms. For example, the 
match expression (= (str.++ x1 x2) x2) with variables 
x1 and x2 matches a +- b = b, but not a +- b x c. Fora 
rewrite rule to apply, a term matched by a declared variable 
must be of the expected sort. We use ? to denote that a term 
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can be of any sort, or to match an arbitrary sort parameter. 
The following example illustrates the use of multiple variable 
occurrences and abstract sorts. 


(define-rule eq-refl ((t ?)) (= t t) true) 


This rule rewrites equalities of syntactically equivalent terms 
to T, regardless of the sort of the term matched by variable t. 


Lists. Some operators defined in SMT-LIB, e.g., string 
concatenation, can be applied to two or more terms. We 
use variables declared with the :list attribute to match an 
arbitrary number of arguments of an operator. The following 
example shows a rule for flattening string concatenations. 


(define-rule str-concat-flatten ( 
(xs String :list) (s String) 
(ys String :list) (zs String :list) ) 
(str.++ xs (str.t++ s ys) zs) ; match 
(str.++ xs S ys Zs)) ; target 


This rule applies to any string concatenation with another 
string concatenation as a subterm. The prefix xs and the suffix 
zs may be empty (although not at the same time in this case). 


Conditional Rules. The previous rewrite rule examples rely 
on purely syntactic matching. To make matching more expres- 
sive, RARE supports the conditional matching of terms using 
define-cond-rule. Such rules have an additional argument, 
the precondition, before the match expression. That is either 
a single condition, expressed by a literal, or a conjunction of 
them capturing all conditions that must be met for the rule to 
apply. When reconstructing a proof, these conditions introduce 
new proof obligations. The following example illustrates the 
use of conditional rules. 


(define-cond-rule concat-clash ( 
(sl String) (s2 String :list) 
(tl String) (t2 String :list) ) 
(and (= (str.len s1) (str.len t1)) 
(not (= sl t1))) 
(= (str.++ sl s2) (str.++ t1 t2)) 
false) 


Dd 
~~ 


This rule rewrites a word equation sı + s2 ti + te 
to L, provided that two conditions are met: the lengths 
of the prefixes sı and tı are the same and the prefixes 
are distinct in the theory T. For example, this rule applies 
to the equality "abc" ++ x ~ "def" ++ y since both 
\"abc"| = |"def"| ~ 3 and "abc" % "def" hold in 
the theory of strings. Note that the precondition |s1| ~ |¢1| 
does not require the evaluation of |s;| and |t;|. Instead, it just 
requires some proof that they are equal. In practice, we prove 
the precondition by applying additional rewrite rules. This 
allows us to show that the precondition holds for equalities 
such as |x ++ y| © |y + zl, for instance. 


Fixed-Point Rules. As an optimization, RARE allows the 
definition of fixed-point rules with define-rulex. These 
rules are repeatedly applied until they no longer apply. They 
are most useful for rewrite rules that effectively iterate over 
arguments of n-ary operators, as we demonstrate in the exam- 
ple below. Fixed-point rules take a match expression, a target 


rc(t x s,d) 
1: if d < 0 then return L 
2: if t~ s € P then 
3 if P[t ~ s] = (fail,e) and d < e then return L 
4 if P[t ~ s] # (fail, e) then return T 
5: if tļe = sļe then P|t ~ s] := eval, return T 
6: if (t ~ s){ = L then Pit ~ s] := (fail, o0), return L 
7: Plt ~ s] := (fail, d) 
8: if (t, s) = (f(u), f(@)) and 
9: , d) for all u ~ v € Ñ ~ v then 
:= cong return T 


rc(u x v 
S 
11: if t = f(u) and ü} = Cand f(c)l. = sļe and 


14: foreach (r, px >u xv) ER 


15: s.t. t = o (u) for some o do 

16: if rc(o(v) ~ s,d—1) and 

17: rc(o(p ~ q),d — 1) forall p ~ q € px g then 
18: Plt ~ s| := r, return T 

19: return L 


Fig. 3: The algorithm for reconstructing a proof sketch P from 
rule database R. Calling rc(t ~ s, d) returns true if the proof 
of t ~ s having depth at most d can be constructed. 


expression, and, optionally, a context expression as arguments. 
The target expression indicates the recursion step, i.e., the term 
that should be rewritten next. The context expression indicates 
how to use the result of the recursion step to construct the final 
result. It is a term with a placeholder _ for the location of the 
result of the recursion step. Omitting the context expression 
is the same as providing a context of _, which indicates that 
the result of the recursion step is also the final result. The 
following example defines a rewrite rule that distributes the 
string length operator over the elements in a concatenation: 


(define-rulex str-len-concat-rec ( 


(sl String) s2 String) 

(rest String :list) ) 
(str.len (str.++ sl s2 rest)) 
(str.len (str.++ s2 rest) ) 


(+ (str.len sl) _)) 


This rule specifies that we rewrite |s1 ++ s2 +... | to |s1ı| +t, 
where ¢ is the result of recursively applying the rule to the 
term |s2 ++... . 

Annotating rules to be fixed-point rules reduces the search 
space during reconstruction, because the reconstruction algo- 
rithm always applies these rules until a fixed-point is reached, 
without considering possible interleavings of other rules. This 
improves efficiency at the cost of not considering some possi- 
ble reconstructions. Thus, there is a trade-off, and this feature 
must be used carefully. 


IV. RECONSTRUCTING PROOFS 


In this section, we describe our approach for constructing 
proofs of rewrites t + t{, using rules from a rewrite rule 


T 
eval trans 


t x tle 


eat sl x tl 
cong ——_——_ ceval ————___— 
f(x LO FO) = (FO))te 
Fig. 4: The basic proof rules; tļe is the evaluated form of t. 


database R obtained by compiling RARE rules. To simplify 
the presentation, we do not consider fixed-point rules for 
now, postponing the general case to later in this section. The 
database 7 stores a set of labeled implications of the form 
(r,p ~ ¢ => t ~ s), where r is a rule identifier, p ~ ¢ 
is a conjunction of term equalities, and p ~ GFrt x s. 
Operationally, the rule specifies that a term ¢ can be rewritten 
to a term s when the premises p œ~ g hold. Note that using just 
equalities in the premises is without loss of generality since 
an arbitrary formula ~ can be expressed as a premise of the 
form p œ% T. Unconditional rules are represented using the 
single, valid premise T ~ T. 

Our proof reconstruction for an equality t ~ tl, based on 
the rule database R consists of two phases. In the first one, 
captured by the algorithm in Figure 3, we search for a proof 
sketch P, which is a map from term equalities to rules that 
can be used to prove them in a final proof. In the second, 
the discovered proof sketch, if any, is transformed into a full 
proof, which may consist of the application of multiple rules 
from R, as described later in this section. 


A. Finding Proof Sketches 


Figure 3 shows our algorithm rc for recursively finding 
proof sketches for equalities t ~ s. The inputs are the (ori- 
ented) equality to prove and an integer d specifying an upper 
bound on the depth of rc’s recursive calls. Some recursive 
calls are generated by the algorithm’s attempt to justify the 
use of a conditional rule from R to prove the input equality. 
In that case, the algorithm attempts to prove the premises 
of the conditional rule, but does so for a decreased depth. 
The rationale behind the depth limit on the search is that 
there is no guarantee that preconditions are simpler than the 
current equality to be proved, and so there is no guarantee 
of termination in general. The depth limit can be chosen by 
the user at runtime to maximize the chances of successfully 
reconstructing a proof for a rewrite while minimizing the 
amount of work spent on unsuccessful parts of the search 
space. Note that d is decremented only in recursive calls over 
the premises of conditional rules. For other recursive calls, 
which are over subterms of the input equality, termination is 
ensured by the reduction in the size of the new input equality. 

The algorithm returns T if it finds a proof sketch for t ~ 
s within the given depth restriction d. During its search, it 
updates a (global) proof sketch map P from term equalities to 
rules r that can be used in the final proof, or to pairs (fail, e) 
indicating that no proof for that equality can be found within 
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depth e. We use the array-like notation P[t ~ s| to refer to 
the value that P associates with t ~ s. A few of the rewrite 
rules stored in P are built-in, the rest are from the database 
R. The built-in rules are provided in Figure 4 in the style of 
inference rules. Note that trans is actually not used for proof 
sketches, but only for the construction of final proofs. 

Going through the algorithm line by line, we see that it first 
returns | if the given depth d is negative. Then, on line 2, it 
checks if a proof sketch for t ~ s has already been determined. 
If so and the value was (fail, e), then no proof was found for 
t © s using depth e. If e is at least d, then it is impossible to 
construct a proof with depth d, and L is returned to indicate 
failure. On the other hand, if a proof already exists, then T is 
returned, indicating success. 

If none of these quick-return cases hold, the algorithm 
tries to prove the equality using several techniques, which we 
informally call proof tactics. First, the algorithm checks if the 
equality can be quickly (dis)proven. Specifically, on line 5 the 
simplest tactic checks whether the equality can be proven by 
evaluation, and returns T if so. We write tļe to denote the 
evaluated form of t, typically a concrete constant c equivalent 
to t, if one can be determined by recursively evaluating (i.e., 
constant-folding) subterms of t, or t itself otherwise. If the 
evaluated form of t and s are the same, the algorithm stores 
in P the information that t ~ s can be proven by evaluation, 
denoted by built-in rule eval. This case applies for instance 
to simple equalities such as 1+ 3 ~ 2 + 2. On line 6, the 
global rewriter of an SMT solver (denoted as |) is used as an 
oracle to check whether the current equality can be rewritten 
to L, which means that the search for a proof sketch is futile. 
In this case, failure is stored as (fail,oo), indicating that a 
proof for t ~ s cannot exist because Fr t æ% s. This is a 
fast albeit incomplete check which is useful when the input 
t ~ s is a precondition of some other rule. If that check fails, 
the search continues because the global rewriter is incomplete, 
and thus a proof for t ~ s may still exist. On line 7, t ~ s 
is tentatively marked in P as (fail, d), but then an attempt is 
made to prove t ~ s using the remaining proof tactics. The 
equality is marked as a failure before running these tactics to 
avoid infinite recursion when t ~ s happens to be a premise 
in some recursive call. 

Line 9 gives our tactic for proving the given equality by 
congruence, which we associate with a proof rule cong. If 
t and s have the same top symbol f and our reconstruction 
algorithm succeeds in proving equalities pairwise for each of 
their arguments ŭ ~ y, we mark t ~ s as proven and return 
T. Line 12 gives our tactic for congruence plus evaluation, 
which we associate with a proof rule ceval. This tactic uses 
the global rewriter again as an oracle to check whether all the 
arguments w@ of t can be rewritten to some constant values 
C, i.e., whether ù} = č. If additionally the evaluation of the 
top symbol f on č is equal to the evaluation of s, then the 
algorithm tries to construct a proof for equalities ù ~ € using 
a recursive call. If it finds a proof, then t ~ s is marked 
proven and T is returned. Failing this, the algorithm applies 
the main proof tactic, which checks whether there is a rule r 
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in rewrite rule database R whose conclusion’s left-hand side u 
matches t under some substitution ø. In this case, it calls itself 
recursively, attempting to prove that: (i) the right-hand side s 
is equivalent to u; and (ii) each premise of that rule holds in 
the theory under the same substitution. If both of these checks 
succeed, t ~ s is marked as proven by rule r. Note that the 
matching does not automatically take into consideration the 
commutativity of operators. Instead, the algorithm relies on 
the commutativity of operators being expressed as additional 
rewrite rules. 


Database Implementation. The algorithm is implemented 
by using a discrimination tree data structure to index the 
conclusions of all rules in R. When a rule is added to R, it is 
normalized so that its variables are taken from a global list and 
assigned based on a left-to-right traversal of the conclusion. 
For example, x+y ~ y+ 2 is normalized to z1 +£2 © %2+21, 
where the global list of integer variables is (£1, £2, ...). We 
enumerate matches for t ~ s based on a single traversal of 
the discrimination tree, which both constructs the matching 
substitution and ends at the rewrite rule identifier. 


Optimizations and Extensions. Our actual algorithm in- 
cludes several optimizations and extensions not shown in 
Figure 3. First, our tactics use a fast failure heuristic that 
avoids making recursive calls for a set of equalities ù + Vv 
if a single u; v; can be shown to fail without recursion. 
For example, our congruence tactic for f(u,0) ~ f(v, 1) fails 
early since (0 ~ 1) = L. Second, we extend our techniques 
for evaluation of arithmetic equalities to incorporate polyno- 
mial normalization, where, for example, the arithmetic term 
y +x + zx can be shown to be equal to 2 x x + y. Third, we 
use additional built-in tactics for Booleans, e.g., that prove 
(t = s) = T ift © s can be proven. Finally, we account 
for fixed-point rules from R (as described in Section HI) 
by an extension to the tactic in line 15. In particular, when 
considering a fixed point rule r with conclusion u ~ v that 
matches t ~ s with substitution ø, we immediately check 
if the subterm of o(v) occurring at the placeholder position 
denoted by r also produces a match using the rule r. If so, we 
store the proof sketch for t ~ o(v) and continue this process 
until we have proven the equality t ~ v’ for some v’. We then 
attempt to prove s ~ v’ along with the required preconditions 
for the application(s) we used to derive t ~ v’. 
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B. From Proof Sketches to Proofs 


We now return to the question of how to transform a proof 
sketch into a final proof. A proof is built out of proof nodes. A 
proof node is a triple (r, q, t), where r is a proof rule identifier, 
is a list of proof nodes, and t is a list of terms. A proof 
checker for a proof rule r is a function taking a list of formulas 
Ø and a list of terms t, and returning either a conclusion 
formula w or failure. Intuitively, the proof checker returns W if 
r concludes w from premises ¢ and a side condition depending 
on terms t. A well-formed proof in a proof system S is a 
directed acyclic graph over proof nodes whose conclusions can 
be assigned based on the proof checkers for the rules in S. In 


particular, a proof node (r, q, t) can be assigned a conclusion 
w if the proof nodes in g are well-formed with conclusions ¢ 
and the proof checker for r on (¢,f) returns 7). 

Overall, the algorithm in Figure 3 maintains the invariant 
that equalities t ~ s map to a rule r by the proof sketch P 
only if entries for the preconditions p of rule r also have been 
successfully added to P, and moreover these dependencies are 
acyclic. Thus, we can transform the proof sketch P into a 
final proof by first recursively reconstructing the proofs of the 
preconditions to the current rule. For equalities t ~ s marked 
with the eval rule, we construct a proof whose proof rule is 
reflexivity or evaluation. For equalities f(u) ~ f(v) marked 
with the cong rule, we first construct proofs for each of u ~ ¥, 
and then construct the proof of f(u) ~ f(v) by congruence. 
For equalities f(u) s marked ceval, after reconstructing 
the proofs of ii ~ G we prove f(u) ~ f(€) by congruence, 
f(© ~ s by evaluation, and then f(ŭ) ~ s by transitivity 
of these two equalities using the trans rule from Figure 4. 
For equalities t ~ s marked with a rule r from our database 
having conclusion u ~ v, we reconstruct the substitution o 
such that t = o(u) by matching. We prove t ~ o(v) by rule r, 
which implies the existence of a proof of o(v) = s (due to 
the recursive call on line 16), and we finally combine them to 
a proof for t ~ s by transitivity. 

Example 1: Suppose we wish to prove the correctness of 
the rewrite substr(substr("abc", 4,1),m,n) ~> "". Further- 
more, assume our rewrite database R contains: 
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(define-cond-rule substr-empty-s ( 
(s String) (m Int) (ñ Int)) 
(= 5 "") (str.substr s mn) "") 


We call the method rc from Figure 3 on the equal- 
ity substr(substr("abc", 4,1), j,k) "" with a cho- 
sen depth d=3. Assume that the proof sketch map P 
is initially empty. For this input, none of the condi- 
tions on lines 1-6 apply. On line 7, we provisionally set 
P[substr(substr("abc",4,1),j,k) œ% ""] to (fail,3). The 
conditions on lines 8 and 10 also do not apply. In the loop 
on line 14, we find that the match term substr("", j, k) from 
rule substr-empty—s matches the left-hand side of our equality 
with substitution o = {s +> substr(substr("abc",4,1),mwH 
j,m |> k}. On lines 16 and 17, we recursively call rc on 
(a("") ~ "",2) and on (o(s ~ ""), 2), respectively. Both 
recursive calls succeed trivially on line 5, where the latter 
equality is substr("abc",4,1) ~ "". Thus, we successfully 
prove the conditions for applying substr-empty-s to our input 
equality. We denote this in P and return T, where P is 
the mapping {"" œ~ "" +> eval, substr("abc",4,1) = 
"" > eval, substr(substr("abc", 4,1), j,k) > 
substr-empty-s}. The proof of the original equality can then 
be constructed trivially based on this mapping, where, overall, 
the proof involves an application of substr-empty—s whose 
premise is proven by eval. 
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V. IMPLEMENTATION 


We implemented both a compiler for RARE and the re- 
construction algorithm, and integrated them with CvC5 [4], a 
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state-of-the-art SMT solver, most of which is instrumented to 
produce proofs [6]. Notably, the rewriter is not instrumented, 
so proof reconstruction is an attractive option for CVC5. Our 
initial implementation focuses on the theory of strings, both 
because it is used in practical applications such as reasoning 
about access policies in the cloud [2], and because it presents 
a challenge due to the large number of complex rules in 
the strings theory rewriter, which are required to achieve 
good performance [21]. The theory of strings is frequently 
combined with the theory of linear integer arithmetic to reason 
about the length and indices of strings. Thus, reconstructing 
rewrite proofs for string problems requires reasoning about 
Boolean, linear integer arithmetic, and string terms. None 
of these theories require parameterized sorts, so the current 
implementation uses concrete types. Supporting rewrite rules 
with partially specified types is left for future work. 

In the following, we discuss the integration of our approach 
in the existing proof infrastructure and our experience using 
RARE to define a set of rewrite rules. We implemented our 
reconstruction algorithm as a module in the existing proof 
infrastructure of CVC5. At compile-time, our compiler for 
RARE populates the rewrite rule database (referred to as R 
in the previous section). As mentioned earlier, RARE aims at 
being a compromise between succinctness and expressiveness. 
The limited expressiveness of RARE means that some desirable 
rewrite rules cannot be expressed in it. To overcome this 
limitation, our reconstruction module supports mixing RARE 
rules with rules implemented in C++. We use this feature, for 
example, for certain integer arithmetic rewrites, as discussed 
below. Reconstructing the proofs for rewrites happens during 
post-processing of the overall proof. If a proof for a given 
atomic rewrite cannot be reconstructed, a generic theory 
rewrite proof rule is used instead. 

The proof module of CVC5 supports the production of 
proof certificates in different proof formats. One of the proof 
formats that is well-supported is LFSC [24]. Proofs in LFSC 
use the same language to define both the proof rules and the 
proofs themselves. As part of our implementation, we extended 
CvC5’s LFSC back end to automatically generate LFSC proof 
rules for each rewrite that appears in a given proof. 

The string theory rewriter in CVC5 is complex—its imple- 
mentation, not including any of the helper functions, amounts 
to over 3,000 lines of C++ code and distinguishes over 200 
different rewrite rules. Moreover, not all of those rules can 
be expressed as a single rewrite rule in RARE. In view of 
these difficulties, we took a pragmatic approach to proof 
reconstruction for the theory of strings: instead of trying to 
implement all of the rewrite rules in RARE, we focused on 
a set of challenging string benchmarks (see Section VI) of 
practical interest, and then defined rules on demand to fill in 
missing subproofs. We ended up with 40 RARE rules for the 
theory of strings. 

The structure of the CvCS5 theory rewriter for arithmetic, on 
the other hand, is quite different. Instead of a large number 
of different rewrite rules, most of the rewriting boils down to 
normalizing polynomials. Thus, for normalizing polynomials 


we implemented a single rule, which is complemented with 
25 rules for arithmetic that do not concern this normalization. 

Finally, the rewriter for Booleans is far simpler than rewrit- 
ers for other theories—its implementation is less than 350 
lines of C++ code. For reconstructing Boolean rewrite rules, 
we took a similar approach to the one for string rewrites and 
defined RARE rules on demand to fill in missing subproofs on 
problems of interest. This led to 22 Boolean rules in RARE. 

While using RARE is not possible or desirable for all rewrite 
tules, it did enable us to iterate quickly to cover the majority 
of missing subproofs for our target benchmarks. 


VI. EVALUATION 


Using our implementation in CVC5, we evaluated the fol- 
lowing research questions: 
e Can we generate fine-grained proofs for rewrites? 
e What is the performance impact of generating fine- 
grained proofs? 


We considered two benchmark sets, both over the theory 
of strings. The first consists of 25 unsatisfiable industrial 
benchmarks that are representative of challenging queries in 
a specific production environment. The second set consists of 
26,626 unsatisfiable benchmarks from the logics QF_S and 
QF_SLIA in the SMT-LIB benchmark library. To determine 
the set of unsatisfiable benchmarks, we used the results from 
an artifact [3] of an earlier evaluation of CVC5, which ran the 
competition configuration of CvC5 for 1200s. 

For the evaluation, we ran all benchmarks with three 
configurations of CVC5: CvC5, which does not generate any 
proofs; CVC5-C, which generates proofs with coarse-grained 
steps for rewrites; and CVC5-F, which uses our approach 
to generate fine-grained proofs for rewrites. For the proof 
reconstruction, we set the depth d to 3. The configurations 
involved in our evaluation are all variants of CVC5 since to the 
best of our knowledge, no other SMT solvers generate proofs 
for nontrivial theory rewrites. In particular, no other solver can 
generate fine-grained proofs for the theory of strings. 

We ran all experiments on a cluster equipped with Intel 
Xeon E5-2620 v4 CPUs running Ubuntu 16.04. We allocated 
one physical CPU core and 8GB of RAM for each solver- 
benchmark pair and used a 900 seconds time limit. 

To measure the effectiveness of our reconstruction, we 
analyzed the generated proofs of benchmarks that were solved 
by all configurations. The proofs for the industrial benchmarks 
contain 43,819 rewrite steps, and the proofs for the SMT-LIB 
benchmarks contain 2,806,761. For those steps, CVC5-F re- 
constructed fine-grained proofs in terms of our current rewrite 
tule database for 95% of the rewrite steps for the industrial 
set, and for 92% of the rewrite steps for SMT-LIB. The lower 
rate in SMT-LIB can be explained by our greater focus on the 
rewrite steps from proofs of the industrial benchmarks. We 
expect that the SMT-LIB rate can be improved to the level of 
the industrial set without significant challenges, i.e., primarily 
by adding more rules to the rewrite rule database. We also 
note that for 20% (5 out of 25) benchmarks in the industrial 
set, CVC5-F manages to produce fine-grained proofs for all 
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TABLE I: Number of solved benchmarks and cumulative 
solving times in seconds on commonly solved benchmarks, 
with the slowdown versus CVC5-C in parentheses. 


Division cvc5 Cvc5-C  CVC5-F 
i Solved 25 25 25 

Industrial (25) Time 238 715 7719 (1.09x) 
Solved 26,615 26,614 26,609 

SMT-LIB (26,626) Time 34028 35932 114330 (3.18x) 
Solved 26,640 26,639 26,634 

Total (26,651) Time 34266 36,647 115,109 (3.14x) 


rewrites, whereas for SMT-LIB, 22% of CVC5-F’s proofs with 
rewrite steps (5,945 out of 26,418) are fully fine-grained. 


Table I summarizes the overhead incurred by our approach 
grouped by benchmark set. Figure 6 shows a cactus plot that 
compares the performance of the different configurations. In 
this experiment, we use CVC5 as a reference point to measure 
the general overhead of proof production, and to compare 
that overhead with the additional overhead of generating 
fine-grained proofs. Table I shows that the overhead on the 
industrial benchmarks for generating proofs is significant, but 
the additional overhead of generating the fine-grained proofs 
is negligible. For the benchmarks from SMT-LIB, the oppo- 
site is the case: the overhead for generating coarse-grained 
proofs is relatively small, but the overhead of generating fine- 
grained proofs is significant. For a better understanding of 
the origin of the overhead, we provide three scatter plots in 
Figure 5. Figure 5a compares the performance of CVC5-C with 
the performance of CVC5-F and shows that for benchmarks 
that are solved quickly with CVC5-c, there are cases where 
the overhead of the proof reconstruction is significant. For 
longer running benchmarks, the overhead seems to be less 
pronounced. In Figure 5b, we plot the solving time in rela- 
tionship with the relative number of atomic rewrites in proofs 
generated by CvC5-c. The plot shows that atomic rewrites 
are featured more prominently in proofs of benchmarks that 
are solved quickly. This may explain part of the overhead for 
easy benchmarks: a larger portion of the proof is being post- 
processed with the reconstruction algorithm. Finally, Figure 5c 
shows the relationship between the difference in solving time 
between CVC5S-F and Cvc5-c and the number of atomic 
rewrites. The plot indicates two trends: more atomic rewrites 
lead to more overhead and—more surprisingly—there seems 
to be a large number of benchmarks with a relatively small 
number of rewrites that have a significant amount of overhead. 


Overall, we find that our approach does not significantly 
affect the number of solved benchmarks. Additionally, it works 
well for the industrial use case that we originally targeted with 
our approach. Some of the SMT-LIB benchmarks, on the other 
hand, make use of complex rewrites such as the ones described 
in earlier work [21], which we have not explicitly optimized 
our current implementation for. 
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Fig. 5: Scatter plots that analyze the overhead of our rewrite proof reconstruction. 
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Fig. 6: Cactus plot that shows the general performance impact 
of generating proofs and the performance impact of generating 
fine-grained proofs for rewrites. 


VII. CONCLUSION 


We presented a DSL-based approach for reconstructing 
fine-grained proofs of rewrite rules. For the future, we plan 
to expand our implementation to other theories in CVC5, 
including theories with parameterized sorts, which will require 
adding support for gradual typing. The DSL proposed in this 
work is independent of the discussed use case and can be used 
to express rewrite rules for SMT solvers in other contexts. 

Another direction for future work is to expand the DSL 
compiler to generate efficient code to replace parts of existing 
theory rewriters, i.e., code that actually performs the rewrites. 
This could make it much easier to explore different sets of 
rewrite rules. It would also make the rewriting code easier 
to understand and maintain. However, since the rewriter is 
called frequently during solving, its performance is critical. 
Therefore, integrating automatically generated code needs to 
be done carefully. Our primary targets in that context are the 


theories of fixed-size bit-vectors and floating-point arithmetic. 

Another back end for the DSL could be used to generate 
verification conditions for the T-validity of rewrite rules. 
These conditions could be discharged using a third-party tool 
such as a proof assistant or another SMT solver. An interesting 
challenge here is that SMT solvers generally only support 
reasoning about fixed-size bit-vectors, whereas rewrite rules 
for the theory of bit-vectors are parameterized by the bit-width. 
We plan to explore approaches for bit-width independent ver- 
ification (e.g., [18]) to discharge these verification conditions. 
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Abstract—Satisfiability Modulo Theory (SMT) solvers and 
equality saturation engines must generate proof certificates from 
e-graph-based congruence closure procedures to enable verifi- 
cation and conflict clause generation. Smaller proof certificates 
speed up these activities. Though the problem of generating 
proofs of minimal size is known to be NP-complete, existing 
proof minimization algorithms for congruence closure generate 
unnecessarily large proofs and introduce asymptotic overhead 
over the core congruence closure procedure. In this paper, we 
introduce an O(n”) time algorithm which generates optimal 
proofs under a new relaxed “proof tree size” metric that 
directly bounds proof size. We then relax this approach further 
to a practical O(nlog(n)) greedy algorithm which generates 
small proofs with no asymptotic overhead. We implemented our 
techniques in the egg equality saturation toolkit, yielding the first 
certifying equality saturation engine. We show that our greedy 
approach in egg quickly generates substantially smaller proofs 
than the state-of-the-art Z3 SMT solver on a corpus of 3760 
benchmarks. 


I. INTRODUCTION 


Congruence closure procedures based on e-graphs [1] are 
a central component of equality saturation engines [2], [3] 
and SMT solvers [4], [5]. Sophisticated optimizations like 
deferred congruence [3] and incremental e-matching [6] make 
such tools faster, but also make guaranteeing correctness more 
difficult [7], [8]. 

Engineers sidestep the challenge of directly verifying high- 
performance congruence implementations by instead extend- 
ing procedures to generate proof certificates [8], [9]. Proof 
certificates provide the sequence of equalities that the congru- 
ence procedure used to establish that two terms are equivalent. 
Clients can safely use results from an untrusted procedure by 
checking its proofs. For example, several proof assistants adopt 
this strategy to provide “hammer tactics” [10] which dispatch 
proof obligations to SMT solvers and then reconstruct the 
resulting SMT proofs back into the proof assistant’s logic, 
thus improving automation without trusting solver implemen- 
tations. 

Proof size can be especially important when extending 
existing verification tools with untrusted solvers. For example, 
in a case study on six Intel-provided Register Transfer Level 
(RTL) circuit design benchmarks [11], an untrusted equality 
saturation engine took under 1 minute to optimize, but the 
existing verification tool took 4.7 hours to replay and check the 
large proof certificates generated by existing techniques [9]. 
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Unfortunately, finding proofs of minimal size is an NP- 
complete problem [12]. 

In this paper, we explore efficient generation of small proof 
certificates for e-graph-based congruence procedures. We first 
introduce the problem of finding minimal size proofs for con- 
gruence closure procedures. We define the space of admissible 
proofs and give an integer linear programming formulation for 
finding a proof with minimal size. Next, we introduce a relaxed 
metric called proof tree size, which directly bounds the size of 
the proof, and develop TreeOpt, an O(n”) time algorithm for 
finding a proof with minimal proof tree size. Unfortunately, 
the O(n°) algorithm is still too expensive for practical use, 
since congruence closure procedures often consider thousands 
of equations. Thus we also developed an O(n log(n)) time 
greedy approach using subproof size estimates. Our algorithm 
incurs no asymptotic overhead relative to congruence closure 
and finds small proofs in practice. 

We evaluate our approach by implementing both proof gen- 
eration and greedy proof minimization in the state-of-the-art 
egg equality saturation toolkit [3], yielding the first certifying 
equality saturation engine. We compare our greedy algorithm 
against the state-of-the-art SMT solver Z3, which performs 
proof reduction (see Section II) to find smaller proofs. Where 
we can run Z3 (Z3 times out in 5.0% of cases), our proofs 
are only 72.8% as big as Z3’s on average (15.0% in the best 
case). Our proofs are also only 107.8% as big as TreeOpt’s on 
average, compared to 147.6% for Z3. Using our greedy proof 
minimizer, we were able to reduce proof replaying time in 
the Intel-provided RTL verification case study from 4.7 hours 
down to 2.3 hours. 

In this paper, we first define the problem of finding the 
minimal proof and provide an ILP formulation (Section M). 
We then introduce the proof tree size metric and an optimal 
O(n°) time algorithm for finding proofs of minimal tree size 
(Section IV). Finally, we demonstrate a practical greedy algo- 
rithm for finding proofs of small tree size with no asymptotic 
overhead (Section V). 


II. BACKGROUND AND RELATED WORK 
Congruence is the property that a = b implies f(a) = f(b). 
Congruence closure refers to building a model of a set of 
equalities that satisfies congruence; these models can be used 
for determining whether other equalities are true (as is com- 
mon in SMT solvers) or for finding new equivalent forms of 
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Fig. 1: A e-graph model of the equalities a+0 = a and 2+2 = 
4 and the expression f(a+0, g(a+0,2+2)). Note that the top 
e-class contains both the expression f(a +0, g(a +0,2 + 2)) 
and the expression f(a, g(a,4)), which proves that these two 
expressions are equal modulo the equalities. 


an expression (as is common in equality saturation engines). 
For example, consider the equalities a+ 0 = a and 2+2 = 4; 
a model of these two equalities should permit queries like 
whether f(a+0, g(a+0,2+2)) has a simpler form or whether 
it is equal to f(a, g(a, 4)). 

A congruence closure model is typically represented as 
an e-graph, which is a collection of e-nodes and e-classes.! 
Each e-node represents a single function being applied and 
an e-class for each argument; each e-class, meanwhile, is a 
set of equivalent e-nodes. Any expression can be inserted 
into the e-graph by converting it recursively into e-nodes, 
while equalities can be added into the e-graph by merging 
the e-classes for the equality’s left and right hand side. For 
example, given the equalities a+0 = a and 2+2 = 4, one can 
determine whether f(a+0, 9(a+0,2+2)) = f(a, g(a, 4)) by 
inserting these two expression into an e-graph and then adding 
the two equalities. The resulting e-graph is shown in Figure 1. 
The two expressions end up in the same e-class, so they have 
been proven to be equal. 

Congruence procedures must handle queries quickly, with 
tens or hundreds of thousands of equalities. The large number 
of equalities means that e-graphs can contain hundreds of 
thousands or even millions of e-nodes, with the resulting 
e-graph taking significant time to construct. A substantial 
literature [3], [6], [13] describes numerous optimizations to 
e-graphs. Past work shows that an e-graph for n equalities 
can be constructed in O(n logn) time [14]. 


Congruence Proofs Proof certificates for e-graphs allow 
checking that two terms are equal without reconstructing the 
e-graph. Instead, for an equality LE, = E witnessed by the 
e-graph, a proof certificate is a list of given equalities that 


'Depending on the author, the “e” in “e-graph” can stand for “expression”, 
“equivalence”, or “equality”. 
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can be applied in order, one after another, as rewrite rules to 
transform £ into E2. Some of these equalities are applied at 
the root of the expression being rewritten, while others apply 
to subexpressions (via congruence). In our running example, 
we can prove f(a + 0,g(a + 0,2 + 2)) = f(a,g(a,4)) as 
follows: 


f(a +0, g(a + 0,2 +4 2)) 
LHS, f(a, g(a + 0,2 + 2)) 
(a, g(a + 0,4)) 


242=4 f 
(a, g(a, 4)) 


a+0=a f 


Note that some equalities may be reused, as in this example. 

Over time, proof certificates have grown increasingly impor- 
tant. In SMT solvers, proof certificates correspond to conflict 
clauses and enable non-chronological backtracking, a key 
component of modern SMT solvers [15]. In proof automation, 
proof certificates bridge foundational logics and unverified 
automated theorem provers, as in the “hammer” style of proof 
tactics [10]. In equality saturation engines, replaying proof 
certifications enables the combination of slow verification 
procedures with fast equality saturation engines. 

To produce proofs certificates, e-graph implementations 
maintain a spanning tree for each e-class, with each edge of the 
tree justifying the equality of the two e-nodes it connects [16]. 
This justification is either one of the (quantifier-free) equalities 
provided as input or a congruence edge that refers to other 
connected nodes in the tree. This spanning tree is maintained 
alongside the union-find structure used for efficiently merging 
e-classes, so there is no algorithmic overhead to maintaining it. 
Producing a proof for the equality of two e-nodes in the same 
e-class is then a simple recursive procedure which traverses 
the path between two e-nodes, recursively finding subproofs 
for each congruence edge. In a spanning tree, there is a unique 
path between any two e-nodes, so this recursive algorithm is 
quite fast, taking O(n logn) time for n equalities. 


Shrinking Congruence Proofs Most uses of proof certifi- 
cates, including generating conflict clauses and replaying and 
checking proofs, take longer as more unique equalities are 
used in the proof certificate. The standard approach to finding 
smaller proof certificates, implemented in SMT solvers such as 
Z3 [5], is based on the observation [16] that proof certificates 
can contain redundant equations; for example, if the given 
equalities include a = b, a = c, and b = c, a proof 
certificate may include all three. By attempting to re-prove the 
same equation while excluding one of the equalities, a proof 
certificate can thereby be shrunk. If the initial proof certificate 
has length k, this proof reduction procedure takes O(k? log k) 
(as checking the validity of each new proof takes O(k log k) 
time using an e-graph). 

This state of the art algorithm is limited in two ways. 
First, when k € o(,/n), it introduces an asymptotic slowdown 
over the rest of the congruence closure algorithm, which 
can answer queries and generate proofs in O(nlogn) time 


(where n is the number of equalities). Second and more 
importantly, proof reduction is ultimately limited by the choice 
of the proof to reduce. Since proof reduction is too slow to 
consider the entire e-graph, a valid initial proof is generated 
before applying proof reduction, discarding many (potentially 
useful) equalities right away. This means that, while it results 
in shorter proof certificates, those proof certificates are still 
longer than optimal. This paper addresses both concerns. 


II. OPTIMAL DAG SIZE 


Because proof certificates often contain repeated subproofs, 
we propose a measure for a proof’s size in terms of the number 
of unique equalities it uses. We call this measure DAG size 
because equalities may be reused in the proof. DAG size is 
also the same as the size of a conflict set in the context of SMT 
solvers. The problem of finding a proof of minimal DAG size 
is also NP-complete [12]. This section formalizes a DAG size 
measure of proof length which accounts for subproof reuse, 
and gives an ILP formulation for finding the proof of optimal 
DAG size. 


A. C-graphs 


Traditionally, each equivalence class in an e-graph is rep- 
resented by a spanning tree. Each edge in the spanning tree 
is either a single equality between two terms or equality via 
congruence. Any additional equalities between nodes already 
connected are discarded, since there is already a way to prove 
the two terms are equal. However, these equalities may enable 
a significantly smaller proof. For example, an e-graph can be 
constructed from the equalities a = b, b = c, and a G: 
The e-graph constructs a spanning tree with edges a = b and 
b = c, discarding a = c. Now the e-graph will admit a proof 
between a and c that has a size of 2. 

Since these additional equalities can be used to produce 
shorter proofs, our algorithm requires storing them. We call 
the resulting structure a c-graph, which maintains a graph, not 
a spanning tree, for each equivalence class. Storing these ad- 
ditional edges merely requires recording information on every 
e-graph merge operation, so can be done without changing the 
complexity of the congruence closure algorithm. The c-graph 
can be substituted directly for an e-graph without changing the 
complexity of the congruence closure algorithm. In practice, 
a c-graph uses the same representation and algorithms as an 
e-graph, but additionally has an adjacency list for each node 
storing this graph of equalities. In the context of producing 
proofs, we define a simple version of a c-graph below: 


Definition 1. A c-graph is an undirected graph G = (V, E), 
where nodes V represent expressions and edges E represent 
equalities, along with a justification j(e) for edge e. A 
justification is either an equality vı = v2 between the vertices 
or a congruence subproof cı = C2, where c; is a child of vi. 


For convenience, we write C for the set of congruence edges 
in E. An edge justified by an equality connects the left and 
right-hand sides of the equality directly, while an edge justified 
by a congruence cı C2 connects terms which are equal 
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Fig. 2: A c-graph proof that a + 0 + 0 = a. There is one 
congruence edge (vo, v1) with j((vo,v0)) = (v1, v2). Since 
vo and v2 are e-connected, the proof holds. 


by congruence over cı and cz (e.g. f(c,) and f(c2)). If two 
terms are equal due to the congruence of multiple children, the 
c-graph contains one congruence edge per argument (one per 
child). This keeps the encoding simple, as each congruence 
edge corresponds to one proof of congruence. All functions 
have a bounded arity, so this transformation does not affect 
complexity results. 

For a c-graph to be a valid proof, all congruence edges must 
refer to e-connected nodes: 


Definition 2. A congruence edge e € E with j(e) = (c1 = c2) 
is valid if the congruent children cı and cz are e-connected 
in the reduced c-graph (G', j), where @ = (V, E \ {e}). All 


non-congruence edges are valid. 


Definition 3. Two vertices v, and v, are e-connected in a 
c-graph (G, j) if there is a path between them consisting of 
valid edges in E. 


A c-graph then proves s = t if the corresponding vertices 
Us and v; are e-connected. The particular path showing that 
Us and v are e-connected, along with proofs for each congru- 
ence edge along the path, represents a particular proof. The 
definition of e-connectedness and edge validity are mutally re- 
cursive; the base case occurs when two vertices are connected 
by a set of non-congruence edges. 

The c-graph structure allows for a simple definition of the 
DAG size metric: 


Definition 4. The DAG size of a c-graph (G, j) is |E \ C 
the number of non-congruence edges it contains. 


’ 


Each non-congruence edge e € £\C could also be assigned 
a positive, real-numbered weight w(e), giving a weighted DAG 
size: ac E\C w(e). Applications could leverage these weights 
in order to sample proofs that minimize an alternative objective 
function, such as the run-time of verifying the steps of the 
proof. The algorithms in this paper easily support weighted 
DAG size, but we will use the simpler definition of DAG size 
with each non-congruence edge assigned a weight of 1. 


B. Minimal DAG Size 


The key to finding shorter proofs is to keep track of a 
c-graph of possible proofs during congruence closure, from 
which a short proof can eventually be extracted. Traditional 
congruence closure algorithms store only one proof of equality 
between any two terms (they generate c-graphs shaped like 
forests) because they discard any equalities they discover 
between already-equal terms. Instead, we will store these 
redundant edges, producing a c-graph shaped like a full graph, 


EDGES Sli, j] < (ij) E B\C Sli, j] = Sly, il 
CONGRUENCE M{i,j,l,r] < (i,j) € EAG((,9)) = =r) Mii, j,l,r] = M[j, i, r,l] 
PATHS Pili, i,j] =0 Pli,k, j] < Vii, j] 
Cli, j] = dig Pli, k, j] Pļ[i, k, j] < Clk, j] 
VALIDITY Vii, j] < Sli, j] + Xin Mli, j, Lr] 
NoCycLes 0< Dļ[i,j] <£ Dli, j = 1ifi Ag 
(1— Pli,k, j])€ + (Dit, j] — Dik, j]) > Dit, k] 
(1— M[i,j,1,r))€+ Dii, j] > DL, r] 
GOAL Clus, vt] = 1 min >? ; Sli, j] 


Fig. 3: An integer linear programming formulation of the minimum DAG size problem. Variables S, M, V, and P are sets 
of boolean variables, while D is integer-valued. Variables are indexed by i, j, and k, which represent nodes in the c-graph. 
Decision variables S and M define which non-congruence and congruence edges of E are selected respectively. £ = |C|ICI+t E] 


bounds the maximum length of a valid non-cyclic path. 


and will then later search this c-graph for a sub-c-graph of 
minimal size. We will also discover any extra opportunities 
for proofs of congruence between terms, adding these to the 
c-graph as congruence edges. 


Definition 5. Consider a c-graph (G, j), all of whose edges 
are valid. We write (G',j) C (G, j) when G’ C G and all 
edges in (G', j) are valid. 


The goal is then to find the sub-c-graph of minimal size in 
which two terms s and t remain e-connected. 


Definition 6 (The Minimum DAG size Problem). Given a 
c-graph (G, j) and two e-connected terms s and t, find a 
(G',j) C (G, j) in which s and t remain e-connected with 
minimal DAG size. 


Note that a sub-c-graph is defined by which edges in G 
it keeps; this allows us to phrase the minimum DAG size 
problem as an integer linear programming problem with one 
decision variable per edge in FE. The full linear programming 
problem is given in Figure 3. It defines selected edges via 
S and M, paths P and e-connectedness C' (via edge validity 
V), and breaks cycles using distance measure D; it is similar 
to the standard formulation of graph connectedness as an 
ILP problem, except with extra constraints for the validity 
of congruence edges. These constraints require the selected 
edges S and M to form a sub-c-graph of (G, j) with all 
edges valid. Finally, v, and v; are asserted to be e-connected 
to ensure that the sub-c-graph proves s = t and then DAG 
size is minimized. While this ILP formulation is solvable by 
industry-standard ILP solvers for very small instances, it is 
NP-complete in general [12]. 
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IV. OPTIMAL TREE SIZE 


What makes the minimal DAG size problem NP-complete 
is the fact that the e-connectedness of multiple congruence 
edges can rely on the same edges. This sharing means that 
the cost of using a congruence edge depends on equalities 
other congruence edges rely on—global information about the 
sub-c-graph of the solution as a whole. Instead of finding 
the optimal solution, we optimize for a different metric to 
achieve a practical algorithm for proof length minimization. 
The distance metric Dfi, j] in the ILP formulation, which we 
call the tree size of a c-graph, is an effective metric for this 
purpose. 

The tree-size of a c-graph is computed by summing the 
length of the proof, without sharing. Specifically, given a 
c-graph (G, j) that proves s = t, its tree size is the tree size 
of the path from v, to v;: 


Definition 7. Consider a path P that e-connects v; to vj in a 
c-graph. The tree size of P is the number of non-congruence 
edges in P plus, for each congruence edge justified by (vı = 
vr), the tree size of the path from v to vr. 


If a c-graph has minimal DAG size, its DAG size is the 
number of non-congruence edges in the graph. Its tree size, 
meanwhile, may count each more than once, so presents an 
upper bound on the DAG size.” We can thereby hope that the 
c-graph of minimal tree size will also have a small DAG size. 


Definition 8 (The Minimum Tree Size Problem). Given a 
c-graph (G, j) that proves s = t, find the (G’,j) C (G,j) 
that proves s = t and has minimal tree size. 


2We chose the name “DAG size” and “tree size” because the relationship 
between these two metrics is similar to the relationship between a DAG and 
a tree containing the same parent-child relationships. 


def optimal_tree_size(start, end): 


1 

2 for i in G.vertices: 
3 dist[(i, i)] = 0 
4 

5 for i£; r) in E \ C: 
6 dist({@, rj] =1 

a 

8 for i in range(|C|): 
9 for (@, r) inc: 
0 dist[@, r] = shortest_path(f, r, dist) 
1 


ee 


return shortest_path(start, end, weights=dist) 


Fig. 4: Pseudocode for the optimal proof tree size algorithm. 
The algorithm keeps a dictionary dist{a, b], the length of the 
shortest tree size from a to b found so far. 


A. Minimum Proof Tree Size Algorithm 


Unlike DAG size, tree size does not have the problem of 
shared edges. Finding a proof of optimal tree size thus does 
not require global reasoning about the surrounding context: 
using the same edges with another part of the proof does not 
reduce the tree size. As a result, it is possible to solve the 
minimum tree size problem in polynomial time. 

Finding a proof of optimal tree size is not a simple graph 
search. The key problem is that congruence edges may contain 
other congruence edges in their subproofs, and the tree size 
of those subproofs is initially unknown. Moreover, often a 
congruence edge (v1, v2) can be proven in terms of another 
congruence edge (v3, v4) and vice versa. Our algorithm tackles 
this problem by computing the size of proofs of congruence 
bottom up, in multiple passes. At the 2-th pass, it constructs 
proofs of equalities between vertices where congruence sub- 
proofs only go 2 layers deep. These proofs form an upper 
bound on the optimal tree size, decreasing in size until 
the optimal proof is found. When the algorithm reaches a 
fixed point, the proof of optimal tree size is discovered. The 
algorithm for finding the size of the optimal proof is given in 
Figure 4. With more bookkeeping, it can be easily extended 
to yield the specific proof the optimal size corresponds to. 

In each pass, this algorithm computes the shortest path 
for each proof of congruence. Non-congruence edges have 
a weight of 1, and congruence edges are initialized to have 
infinite weight. A fixed point is guaranteed after |C| iterations, 
because each subproof for a congruence edge e cannot use 
the same edge e again (else its tree size would increase). 
The overall running time of the algorithm is bounded by 
O(|C|?|E|), with |C|? being the number of calls to the 
shortest path algorithm and || being the complexity of finding 
a shortest path given the weights. Since there may be n? 
congruence edges for n nodes in the graph, the overall running 
time is also bounded by O(n°). However, in practice the 
number of congruence edges is some constant multiple of n, 
and in this case the running time is O(n°). 


V. GREEDY OPTIMIZATION OF PROOF TREE SIZE 


The optimal algorithm of Section IV finds the proof with 
minimal tree size, but it does so at an unacceptable cost: 
its running time dominates the O(nlogn) running time of 
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def greedy(start, end, pf_size_estimates): 


1 

2 todo = Queue((start, end)) 

3 fuel = T 

4 

3 while len (todo) > 0: 

6 (start, end) = todo.pop() 

7 path = shortest_path(start, end, pf_size_estimates) 
8 for edge in path: 

9 match edge: 

10 congruence (£, r) -> 

11 if fuel > 0: 

12 todo.push (£, r) 

13 fuel = fuel - 1 

14 else: 

15 add_to_proof (unoptimized_proof (£, r)) 
16 axiom(a) -> 


add_to_proof (a) 


Fig. 5: Pseudocode for the greedy optimization of proof tree 
size. The algorithm either recurs for congruence edges if fuel 
allows, or it uses the estimates for each congruence edge. 
Unlike TreeOpt, the algorithm is top-down and terminates after 
T steps. 


congruence closure itself [1]. In the context of c-graphs, 
n = |E \ C|, the set of input equalities to congruence closure. 
This section thus proposes a greedy algorithm for proof tree 
size, which reduces tree size and DAG size significantly in 
practice, though it is not optimal with respect to either metric. 


A. Greedy Optimization 


The key insight behind the greedy algorithm is that the 
multiple passes of the optimal algorithm are only necessary to 
compute the minimal cost of congruence edges. If the tree size 
for each congruence edge were known, the proof with optimal 
tree size could be found by a simple shortest path algorithm. 
The greedy algorithm is a simple breadth-first search shortest 
path algorithm that takes estimated costs for congruence edges 
as an input. The closer the estimates are to the proof of optimal 
tree size, the better the results of the greedy algorithm. 

Defer for now the challenge of estimating the tree size for 
each congruence edge, and focus on the greedy algorithm 
itself. The algorithm is simple: use a breadth-first search to 
choose a path from the start vertex s to the end vertex t 
of minimal length, using the estimates for each congruence 
edge. However, those estimates may not be optimal, so the 
algorithm then recurses for each congruence edge. Note the 
difference between the optimal algorithm (which first opti- 
mizes congruence edges) and the greedy algorithm (which 
first finds a shortest path). If the recursion were performed 
until all congruences are optimized, this algorithm would take 
time O(|C|(n + |C|)), which is still too high compared to 
the O(n log(n)) runtime of congruence closure. Instead, only 
T expansions of congruence edges are permitted; in practice, 
we choose T = 10, which seems to work well. After T 
expansions, there may be sub-proofs which have not been 
generated. In this case, the algorithm defaults to a generic 
proof production algorithm for the remaining sub-proofs [16]. 
Figure 5 lists the greedy algorithm. 


Fig. 6: An example reduced c-graph with a single congruence 
edge. The root of the tree is the vertex labeled v4 at the top, 
and there is a single congruence edge (v1, v2) in the spanning 
tree. The proof of congruence between vertices 1 and 2 has 
a tree size of two because the proof between the congruent 
children involves two equalities. 


B. Estimating Tree Sizes 


The main challenge to instantiating the greedy algorithm 
is generating size estimates for congruence edges. However, 
there is a simple way to do so: reduce the c-graph to a forest 
(G, j) with one tree per connected component, in such a way 
that all edges remain valid. Luckily, the traditional congruence 
closure proof production algorithm generates such reduced 
c-graphs by omitting any unions which connect already-equal 
terms. Now, the tree size of a proof of congruence can be 
estimated by directly calculating the tree size of a proof in the 
reduced instance. In such a reduced c-graph, there is only one 
possible path between any two nodes, so the proof is unique. 

Computing the tree sizes of all proofs in the reduced c-graph 
requires some care to stay within the necessary asymptotic 
bounds. First, each tree in (G+, 7) is arbitrarily rooted. Given 
a vertex a, let size[a] be the size of the proof between a 
and the root of its tree. Then the tree size of the proof between 
any two vertices a and b can be calculated 


size[a] + size[b] 2 x size[lca(a, 6)], 
where lca computes the least common ancestor of a and b 
in the tree. The lca function can be pre-computed for all 
relevant proofs in O(n) time using Tarjan’s off-line algorithm 
[17]. 

Figure 7 shows the pseudocode for calculating proof tree 
sizes given (G:,7). To avoid an infinite loop in proof length 
calculation, the algorithm builds each tree in (G+, 7) incremen- 
tally using a union-find structure (using the parent array). 
Consider the example in Figure 6, in which the path to the 
root node v4 contains a congruence edge. The tree size of the 
proof between nodes və and v4, written tree_size (v2, 
v4), involves calculating the size of the congruence proof 
tree_size (vı, v3).Sotree_size (vg, v4) cannot be 
computed using v4 as the root of the tree, since the path to 
the root involves the congruence edge. Instead, the algorithm 
uses least common ancestor v2 to compute tree_size (v1, 
v3). Because the proof is e-connected, any congruence edges 
on the path to the least common ancestor can be computed 


recursively without diverging. 


def path_compress (vertex): 
if parent[vertex] != vertex: 
path_compress (parent [vertex] ) 
parent [vertex] = parent [parent [vertex] ] 


CAADNEWNHE 


size[vertex] = size[vertex] + size[parent [vertex] ] 
def traverse_to_ancestor(v, ancestor): 
while parent[vertex] != ancestor: 
10 edge = parent_edge (parent [vertex], G) 
11 parent [edge.start] = edge.end 
12 if is_congruence (edge): 
13 traverse (j (edge).start, j(edge).end) 
14 estimate_size (edge) 
15 path_compress (vertex) 
16 
17 def traverse (start, end): 
18 path_compress (start) 
19 path_compress (end) 
20 ancestor = argmin( 
21 (lca (start, end), parent[start], parent [end]), 
22 distance_to_root) 
23 path_compress (ancestor) 
24 
25 # Ensure that start, end, and their lca share a parent 
26 traverse_to_ancestor (start, ancestor) 
27 traverse_to_ancestor (end, ancestor) 
28 estimate_tree_size (start, end) 
29 
30 def estimate_tree_size (start, end): 
31 tree_size[(start, end)] = size[start] + size[end] 
32 - 2xsize[lca(start, end) ] 
33 
34 def estimate_size (edge): 
35 match edge: 
36 congruence(left, right) -> 
37 size[edge.start] = tree_size[(left, right)] 
38 axiom (a) -> 
39 size[edge.start] = 1 
40 
41 for i in G.vertices: 
42 parent[i] = i 
43 size[i] = 0 
44 
45 for (start, end) in congruence_edges (G): 


46 traverse (start, end) 


Fig. 7: Pseudocode for computing tree sizes of all congruence 
proofs given (Gz, j). The algorithm efficiently computes these 
tree sizes by storing a union-find datastructure that keeps 
track of size, the size of the proof between a node and 
it’s parent. Computing the size of a proof involves traversing 
the proof, updating the union-find whenever the size of a 
sub-proof is discovered. The pseudocode uses the function 
distance_to_root to denote the number of edges from 
v to the root of its tree. It also makes use of lca, a function 
that returns the lowest common ancestor of two vertices. 


Each congruence edge results in at most one recursive call 
to traverse, while non-congruence edges are added to the 
union-find data structure directly. Ultimately, each edge in the 
c-graph contributes at most five union-find operations: three 
find operations at the start of tree_size, one union 
operation to add it to the union-find data structure, and one 
more find in traverse_to_ancestor. A sequence of 
m operations on a union-find data structure with A nodes can 
be executed in O(m log(h)) time [18]. This means the overall 
cost of estimating sizes for congruence edges is O(n log(n)) 
since n bounds both m and h (recall n = |E \ C|). Adding 
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Fig. 8: This CDF compares the unoptimized (gray solid), 
Z3 (blue dashed), greedy (green dash-dotted), and TreeOpt 
(red dotted) proof generation algorithms on the same 3571 
benchmarks where Z3 does not time out. Each line shows 
the number of benchmarks whose proofs are at most the size 
indicated on the horizontal axis. Our greedy approach (green) 
closely tracks the size of TreeOpt’s (red) proof certificates, 
showing that its certificates are difficult to shrink further. Five 
outliers with an unoptimized DAG size of more than 100 are 
omitted. 


on O(n + |C|) cost for the greedy algorithm itself yields an 
overall runtime of O(n log(n)+n+ |C|) = O(nlog(n)+|C]). 
Limiting the number of congruence edges C to a multiple of 
n results in a O(n log(n)) runtime, introducing no asymptotic 
overhead compared to congruence closure alone. °? 


VI. EVALUATION 


This section compares an implementation of our greedy 
proof generation algorithm in the egg equality saturation 
toolkit [3] to Z3’s proof generation [19]. As described in 
Section II, Z3 applies proof reduction to the first proof it finds, 
which substantially reduces proof size. Our greedy approach 
instead attempts to extract a minimal proof from the e-graph. 
We found that, even without a proof reduction post-pass, our 
greedy approach can quickly find significantly smaller proofs 
than Z3 (Figure 8). 


A. Comparing egg to Z3 


We use Z3 version 4.8.12 and egg version 0.7.1 compiled 
with Rust 1.51.0. egg is a state-of-the-art equality saturation 
library that implements the rebuilding algorithm for speeding 
up equality saturation workloads. It is used by projects like 
Herbie [20], Ruler [21] and Szalinski [22]. Z3 is a state-of- 
the-art automated theorem prover and is optimized for theorem 
proving workloads. To create a realistic benchmark set, we 
used the Herbie 1.5 numerical program synthesis tool [20]. 
Herbie uses equality saturation for program optimization and 
comes with a standard benchmark suite of programs drawn 
from textbooks, research papers, and open-source software. 
We extracted Herbie’s set of quantified equalities and recorded 
all inputs and outputs from its equality saturation procedure. 


3In practice, |C] is typically a small constant factor larger than n. We use 
a constant factor of 10n as a reasonable limit on the number of congruence 
edges. 
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TABLE I: Data comparing egg to Z3 using different proof 
production algorithms: egg with proofs of optimal tree size, 
egg with greedy optimization, egg with traditional proof re- 
duction (see section II), Z3, and egg without any optimization. 
Note that proof reduction’s analysis is in terms of k, the size 
of the unoptimized proof, while n is the size of the entire 
c-graph instance. In practice, k is often small relative to n. 


Algorithm TreeOpt Ave Time (ms) Complexity 
TreeOpt 100.0% 1008.60 O(n?) 

Greedy 105.9% 39.33 O(nlog(n)) 

Egg Reduc. 138.7% 23.01. O(nlog(n) + k? log(k)) 
Z3 147.3% 130.69 O(nlog(n) + k? log(k)) 
Egg 185.9% 22.15 O(n log(n)) 


This results in 3 760 input/output pairs, of which we focus on 
the 3571 where Z3 did not produce an answer after 2 minutes. 

For the Z3 baseline, we converted each input/output pair 
into a satisfiability query by asserting each quantified equality 
(with a trigger for the left hand side of the equality) and then 
asserting that the input and output are not equal. Z3 then 
attempts to prove the input and output are equal using an e- 
graph and the quantified equalities (the theory of uninterpreted 
functions). We then computed the DAG size by counting the 
number of calls to its quant-inst command [23] in its 
proof scripts. We ran egg exactly how it is used by Herbie, 
and then optimized proof length using the greedy algorithm of 
Section V and measured DAG size by counting proof nodes. 
Z3 times out after 2 minutes for 5.0% of the input/ouput 
pairs, and completes in 213.25 milliseconds on average for 
the remainder. egg does not time out, and runs for an average 
of 39.57 milliseconds. To measure DAG size for the resulting 
proofs, we ran both egg and Z3 in proof-producing mode and 
examined the resulting proofs. 

Figure 8 contains the results: the proofs produced by egg 
are 72.8% as big as Z3’s on average, despite Z3’s use of a 
proof reduction algorithm. Moreover, the effect of proof length 
optimization is greater for longer proofs: queries with Z3 DAG 
size over 10 see an average 36.0% reduction, while queries 
with Z3 DAG size over 50 see an average 49.7% reduction. 


B. Detailed Analysis 


In this section, we perform a more detailed ablation study 
comparing egg’s results using different algorithms. We im- 
plement proof reduction for egg and the optimal tree width 
algorithm described in Section IV. The ILP solution is not 
feasible to run, so we use Z3 as a baseline. 

Table I summarizes the results. Z3 and egg are optimized 
for different workloads and so use different underlying con- 
gruence closure algorithms, and so produce different proofs. 
Using proof reduction, egg finds slightly shorter proofs than 
Z3. It also performs better than Z3-style proof reduction 
implemented in egg. Using the greedy algorithm, egg finds 
proofs which are even shorter, and which are also quite close 
to proofs of optimal tree size. The data in Table I consists of 
the 3571 out of 3760 where Z3 did not time out, the same 
set used in Figure 8. 


TABLE II: RTL design benchmark results. Total runtime includes equality saturation and proof production runtimes but excludes 


any formal verification time. 


Tree Size DAG Size Runtime (sec) 
Benchmark Orig Greedy Reduce Orig Greedy Reduce Total Proof Proof % 
Datapath 1 174 90 48% 67 61 9% BTS 2.58 1% 
Datapath 2 561 92 84% 98 46 53% 34.5 2.08 6% 
Datapath 3 14 13 1% 13 12 8% 5.13 0.49 9% 
Datapath 4 4402 202 95% 223 120 46% 76.4 32.80 43% 
Datapath 5 271 95 65% 101 12 29% 105 0.18 0.2% 
Datapath 6 155 83 46% 67 49 27% 280 168.00 60% 


While we would ideally use the minimal DAG size proofs as 
a baseline in our evaluation, we found the ILP formulation was 
infeasible to run on real queries. However, the O(n”) TreeOpt 
algorithm, which runs in O(n?) time when the number of 
congruences is bounded, performs well enough to run on all 
of the examples. We found that in 81.1% of these cases, the 
greedy algorithm in fact found the proof with optimal tree 
size. Moreover, across all of these benchmarks our greedy 
algorithm’s overall performance closely tracks that of TreeOpt, 
showing that the greedy algorithm’s proof certificates are 
difficult to shrink further. 


C. Case Study 


Typically, proof production is necessary in equality sat- 
uration to perform translation validation. In this case, the 
shorter proofs produced by proof length optimization re- 
duce the number of translation validation steps that must 
be performed and thus result in faster end-to-end results. 
A practical application that benefits from this reduction is 
hardware optimization performed using egg by researchers 
at Intel Corporation [11]. Translation validation is used to 
ensure that the egg optimized hardware designs are formally 
equivalent to the input. Extremely high assurance is needed 
for hardware designs because of the high cost of actual 
hardware manufacturing. For each step in the tree proof two 
Register Transfer Level (RTL) designs are generated, which 
are proven to be formally equivalent by Synopsys HECTOR 
technology, an industrial formal equivalence checking tool. 
The intermediate steps generate a chain of reasoning proving 
the equivalence of the input and optimized designs, necessary 
because the tools can fail to prove equivalence of significantly 
transformed designs. The tree proof is used to ensure that 
HECTOR can prove each step with no user input as it is a 
simpler check than a DAG proof step. 

The results of evaluating this paper’s greedy optimization 
algorithm on six Intel-tested RTL design benchmarks are 
shown in Table II. On average, proof lengths decreased by 
29%, with the best case showing a 53% reduction, while 
proof production took only 34 seconds on average, miniscule 
compared to multi-hour translation validation times. Moreover, 
these reductions in proof length resulted in shorter transla- 
tion validation times. The optimized constant multiplication 
hardware design descibed in Figure 9 was generated by egg, 


5a +b 
> 4a + 2b 
3a + 3b 
2a + 4b 
a + 5b 


Fig. 9: Dataflow graph of an optimized multiple constant 
multiplication circuit design generated by egg. 


starting from an initial naive implementation. Running the 
complete verification flow for the original and greedy proofs, 
the runtime was reduced from 4.7 hours to 2.3 hours. In 
more complex examples we expect that days of computation 
could be saved. For parameterizable RTL, where a design must 
typically be re-verified for every possible paramterization, 
these gains add up quickly. 


VII. CONCLUSION AND FUTURE WORK 


This paper examined the problem of finding minimal con- 
gruence proofs from first principles. Since finding the optimal 
solution is infeasible, we introduced a relaxed metric for proof 
size called proof tree size, and gave an O(nř) algorithm for 
optimal solutions in that metric. While the optimal algorithm 
is too expensive in practice, it provides a reasonable base- 
line for small congruence problems, and inspired a practical 
O(nlog(n)) greedy algorithm which generates proofs which 
are 107.8% as big on average. 

We implemented proof generation in the egg equality 
saturation toolkit, making it the first equality saturation engine 
with this capability. Since equality saturation toolkits—unlike 
SMT solvers—support optimization directly, this opens the 
door to certifying the results of much recent work in opti- 
mization and program synthesis [3], [20]-[22], [24]-[26]. 

Looking forward, we are especially eager for the community 
to explore more applications of proof certificates in congru- 
ence closure procedures. For example, it should be possible 
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to use proofs to tune rewrite rule application schedules in 
e-matching, improve debugging of subtle equality saturation 
issues, and enable equality-saturation-based “hammer” tactics 
in proof assistants. It may also be possible to further improve 
on the greedy proof generation algorithm with better heuristics 
for estimating proof sizes, or to enable more efficient prover 
state serialization via smaller proofs. 
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Abstract—With the increasing availability of parallel computing 
power, there is a growing focus on parallelizing algorithms 
for important automated reasoning problems such as Boolean 
satisfiability (SAT). Divide-and-Conquer (D&C) is a popular 
parallel SAT solving paradigm that partitions SAT instances 
into independent sub-problems which are then solved in parallel. 
For unsatisfiable instances, state-of-the-art D&C solvers generate 
DRAT refutations for each sub-problem. However, they do not 
generate a single refutation for the original instance. To close 
this gap, we present Proof-Stitch, a procedure for combining 
refutations of different sub-problems into a single refutation for 
the original instance. We prove the correctness of the procedure 
and propose optimizations to reduce the size and checking 
time of the combined refutations by invoking existing trimming 
tools in the proof-combination process. We also provide an 
extensible implementation of the proposed technique. Experiments 
on instances from last year’s SAT competition show that the 
optimized refutations are checkable up to seven times faster than 
unoptimized refutations. 

Index Terms—Parallel SAT, Divide and Conquer, Refutation 
Checking 


I. INTRODUCTION 


Boolean satisfiability (SAT) solvers have improved dramati- 
cally in recent years. They are now regularly used in a wide 
variety of application areas including hardware verification [1], 
computational biology [2] and decision planning [3]. 

With the emergence of cloud-computing and improvements 
in multi-processing hardware, the availability of parallel 
computing power has also increased dramatically. This has 
naturally led to an increased focus on parallelizing important 
algorithms, and SAT is no exception. There are two traditional 
approaches to parallel SAT solving - the Divide-and-Conquer 
(D&C) approach [4]-[6] and the portfolio approach [7]. In the 
D&C approach, the original SAT instance is partitioned into 
independent sub-problems to be solved in parallel, while in 
the portfolio approach multiple SAT solvers are independently 
run on the original instance. Although the portfolio approach 
in combination with clause sharing performs well for small 
portfolio sizes, the D&C approach scales better in environments 
with large parallel computing power such as the cloud. Several 
implementations of D&C solvers exist [4]-[6], [8]. Every 
implementation uses: a divider to split up the original instance 
into sub-problems, and a base SAT solver to solve the 
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independent sub-problems. For example, ggSAT [8] uses 
CadiCaL [9] as its base solver. 


If a SAT problem is unsatisfiable, a proof of unsatisfiability 
(or refutation) can be produced and independently checked to 
validate the result. Since 2013, the annual SAT competition 
has required SAT solvers to generate refutations. The most 
commonly supported refutation format today is the DRAT 
format [10]. Existing D&C SAT solvers produce refutations 
for each sub-problem independently. However, even if the 
refutation for each sub-problem passes the proof-checker, this 
is not a formal guarantee that the original instance also admits a 
refutation, as there could have been an error in the partitioning 
strategy. For example, a buggy solver may incompletely 
partition the SAT instance (~41) A (l2 V €3) A (%2 V £3) 
into sub-problems with cubes £; and £2. Both of these 
sub-problems are unsatisfiable, even though the instance is 
satisfiable. Transient errors in the underlying distributed system 
may also cause sub-problem refutations to be truncated or 
missing. To address these challenges, we introduce Proof- 
Stitch, which implements a strategy for combining DRAT 
refutations for sub-problems into a single refutation for the 
original instance, a process we call refutation stitching. Our 
contributions are: 


e We describe an algorithm for combining DRAT refutations 
of partitions of problems into a single refutation for the 
original problem and provide an open-source implementa- 
tion on GitHub [11]. 

e We describe an optimization technique leveraging existing 
trimming tools (e.g., drat-trim [12]) to improve the quality 
of the combined refutations. 

e We evaluate our implementation on benchmarks from 
last year’s SAT competition [13]. Our results show that 
trimmed refutations are checkable up to seven times faster 
than untrimmed refutations. 


The rest of this paper is organized as follows. Section I 
discusses background and related work. Section III presents the 
Proof-Stitch algorithm and theoretically justifies our method 
of combining refutations. We also describe an optimization 
technique that reduces the checking time and the size of the 
combined refutations. Section IV details our tool implemen- 
tation. Results are presented in Section V, and Section VI 
concludes. 


This article is licensed under a Creative 
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II. BACKGROUND AND RELATED WORK 
A. Propositional refutations 


We assume familiarity with the basic concepts of CDCL 
SAT algorithms (see, e.g., [14]). We also assume that a base 
SAT solver can produce a DRAT refutation, which we define 
below (following [15]). 

Throughout the paper we model clauses as sets of literals 
and formulas as multisets of clauses. By -U-, we denote the 
standard union operation on sets, and the multiplicity-summing 
union on multisets. 

Let F = {C,,...,C,} be a formula. F unit propagates on £ 
to FY = {C\{Al}: Ce Fl g CyULE} (written F >, F”) if 
there exists a clause {¢,4,...,¢,} € F such that {7¢;} € F 
for i € [1,k]. If F +e F’ for some 4, then F — F’. We say 
that F — L if F contains an empty clause. Let the relation 
—* denote the reflexive, transitive closure of —. We say that 
F +> F’ when F —* F” and there is no F” Æ F’ such that 
F’ — F”. One can show that the +> relation is a function. 
We say that C = {4,..., lk} has asymmetric tautology (AT) 
with respect to F if F U {~41} U- -- U {4k} > L. We say 
that C has resolution asymmetric tautology (RAT) with respect 
to literal 41 € C and F if for all C” € F containing 7;, 
CU(C’ \ {4 }) has AT. 

Let o; denote an operation. Consider a sequence of operation- 
clause pairs m = ((01,C1),...,(O0m,Cm)), where each o; 
indicates either the addition (®) or deletion (©) of a clause 
from a formula. 

Let @ denote a CNF formula. Define ¢; recursively: ġo = ¢, 
and Pi41 is Qi U {Citi} when Oj41 is ©, or Qi \ {Ciz} 
otherwise. The sequence 7 is a DRAT refutation of ġ if when 
0i+1 = ® then Ci+ı has RAT with respect to ¢;, and if the 
last element in 7 is (4,9). 


B. Divide-and-Conquer SAT solving 


One parallel SAT solving paradigm is Divide-and-Conquer: 
a SAT instance is divided into simpler SAT instances (sub- 
problems), which are then solved in parallel. Typically, the 
sub-problems represent partitions of the search space, such 
that the disjunction of all the sub-problems is equisatisfiable 
with the original problem. The sub-problems are derived 
from the original instance by assigning Boolean values to 
literals. The set of literals that are assigned (decided) for a 
particular sub-problem is called the cube of the sub-problem 
and the number of literals in the cube is the depth of the sub- 
problem. There are many D&C-based solvers [4]-[6], including: 
Psato [16], Painless [17], and AMPHAROS [18]. One 
prominent D&C approach, Cube-and-Conquer [19], uses a 
lookahead solver to divide instances and a CDCL solver to 
solve sub-problems. This approach has been successful for 
large mathematical problems [20] and is implemented by tools 
such as Paracooba [21] and gg-sat [8]. 

D&C SAT solvers generate separate DRAT refutations for 
each sub-problem. There has been little work on combining 
these refutations into a single refutation for the original instance. 
One work [22] considers proof composition, but its parallel 
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composition rule does not apply to DRAT refutations. Another 
work [23] gives an alternate proof calculus for parallel solvers. 


II. METHODOLOGY 


In this section, we present an algorithm to combine sub- 
problem refutations into a refutation for the original Boolean 
instance. Then we show the algorithm’s correctness. Finally, 
we present a technique to optimize the combined refutations. 


A. Algorithm 


The first step in the Proof-Stitch algorithm is to construct a 
decision tree representing the steps taken by the D&C solver. 
The root of the tree represents the original instance, and the 
leaves represent the sub-problems. Figure 1 shows the decision 
tree for an example instance. 


Algorithm 1: Stitching algorithm 
In : Instance: ¢, 
Decision literal: x, 
Refutations of: 
PUL{{z}}: m= ((01,C1),.--, (On, Cn)), 
PU {{>2}}: T = ((01, C1), (Om: Cm) 
Out : Refutation of ¢ 
procedure stitching (Q, £, T, T") 
return 


( (01,01 ui ren a, 
(A OL UEI i Om On UL2}), EM ) 


Next, Proof-Stitch performs a sequence of stitching oper- 
ations to produce a single refutation for the original SAT 
instance. A stitching operation (Algorithm 1) reads in a SAT 
instance œ, a decision variable x and two refutations 7 and 7’ 
corresponding to the sub-problems oU {{x}} and oU {{7a}} 
respectively. It produces a single refutation corresponding to the 
instance @. The refutation for instance ¢ contains the clauses 
from refutation 7 appended with the literal ~x and the clauses 
from refutation 7’ appended with the literal x. More generally, 
the clauses from a refutation are appended with the negation 
of the decision literal used to generate the sub-problem. Figure 
2 illustrates the stitching operation. 

As an example of the proof combination process, consider 
Figure 3. First the refutations moo and mo are combined. 
Then 719 and 7; are combined, and finally, mo and 7, are 
combined to produce the refutation m corresponding to the 
original instance. In Proof-Stitch, the stitching operations are 
ordered according to the following rule: A stitching operation to 
combine a pair of refutations 7 and 7’ can only occur after all 
refutations with greater depth have been combined. Informally, 
this means that refutations are combined in decreasing order 
of their depth, as shown in Figure 3. Stitching operations at 
the same depth are independent and can occur in parallel. 


ga | 701 | | 20 | ga 


Fig. 1: Decision tree of an example unsatisfiable SAT instance. 


{4, £2, l3} {4, l2, £3, L7} {4, l2, l3, l7} 


{arts} | Tyr] {ttr} {l2, ls; lr} 
{44, 45} {l4, £5, l7} \ {l4, £5, 77} 
{} {707} {77} 
{l4, £2, l7} 
{£3, l5, l7} 


{ez} 


0 
ae a 


lr {64, £2, l7} 
{é3, L5} — {é3, 5, l7} 
{er} 


Fig. 2: Stitching operation on example refutations 


B. Justification for the stitching operation 


We now show that Algorithm | is correct: given suitable 
inputs, it produces a DRAT refutation for œ. 


Definition 1. A DRAT refutation x is preserving if for all ©, 
(©, C) occurs at most as many times in m as (Ẹ®, C). 


Lemma 1. Let ¢ be a CNF formula, x be a variable, and 
m and 1’ be preserving DRAT refutations of ¢U {{x}} and 
pU {{7a}} respectively. Then, stitching(d, £, n, T) outputs a 
preserving DRAT refutation of ¢. 


Proof. Let n* be the output of stitching. Let m 
((01,C1),---;(On;Cn)) and a’ ((o1,,C{),---,(0.,C%,)). 
Let y = U{{ax}} and Yy’ = dU{{7z}}. Define y; recursively, 
by wo = y and Wi41 = Yi U{Ci41} when 0,41 is an addition, 
and 41 = Yi \ {Ci41} otherwise. Define Y; (respectively 
pi) analogously, based on formula 7’ (resp. @) and refutation 
n’ (resp. T*). 

By construction, 7*’s final step is (®, Ø). Moreover, since 
m and x’ are preserving and formulas are clause multisets, 
m* is preserving. Thus, our main task is to show that each 
addition (©, C7,,) in 7* has RAT with respect to ¢;. Cj,, 
is either derived from a clause in 7, derived from a clause in 
x’, or is the final empty clause. We begin with the first case: 
Ch = OF41 U {na}. 

First, we show that if Cj,, has AT with respect to 
w;, then Cj,, has AT with respect to ¢;. Note that Y; U 
Hah. {oleh} = FUH Uh, -h a 
F” U {{ah} U {{aei},...,{7& }} => L. Now, consider 
F” = hiU {{2}, {4} -n {k} E F 4 L, then OF; 
has the desired property. Observe that F” >, F” U {{a}}U 
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{{4:},.--, {Zk }}; thus, since the latter propagates to bottom, 
EF" does too. 

Second, we show that if Cj+ı has RAT with respect to literal 
£ and formula 7;, then Ci, = {>x} U Cj41 has RAT with 
respect to literal £ and formula ¢;. Let C* be a clause in ¢; that 
contains ~l. If C*, ,U(C*\ {7¢}) has AT with respect to ¢;, we 
are done. Since C™ is a clause in @;, there is some C in Y; such 
that CU{72} = C* or C = C*. Thus, C7, U (C* {4} = 
{nz} U Cj41 U (C \ {70}). Let az, l1, ..., lp be the literals 
of this clause. As before, since Y; U{{74},...,{7€,}} unit 
propagates to bottom, ¢; U {{r}, {41}, ..., {Lk }} does too. 

In the case that C$, = Ci,, U {x} Ge., Cı is derived 
from 7’), the argument is similar. The key insight is that 
an initial propagation on ~x in any AT check removes all 
the clauses added by z. Since m deletes no clauses from the 
original formula, this leaves an intermediate propagation result 
that shows CY, , is RAT. 

The final step in 7* is (6,0). It has AT because ¢n4m 
contains both {x} and {~x}. Since m*’s added clauses all 
have the AT or RAT properties, and the final step adds an 
empty clause, 7* is a valid DRAT refutation of ¢. 


In Proof-Stitch, the final refutation is built through stitching 
operations on DRAT refutations of the sub-problems. Since 
each stitching operation produces a preserving DRAT refutation, 
recursive application of Lemma 1 proves that the final refutation 
is a valid DRAT refutation of the original instance. 


C. Optimization 


Empirically, we have observed that refutations created 
through stitching operations contain a large number of clauses 
that are not needed during validation ("redundant" clauses). 
Identifying and removing these clauses reduces the time 
required to check the refutation and the storage space required 
to save the refutation. One approach to remove such redundant 
clauses is by identifying the "unsatisfiable core" as described 
in [24]. This approach optimizes the refutation by only retaining 
clauses that are essential for validation by a proof-checker. Our 
implementation optimizes refutations by using drat-trim to 
extract the unsatisfiable core after every stitching operation. 

However, aggressively invoking the optimization technique 
(e.g., after every stitching operation) could incur significant run- 
time overhead in the refutation generation process. This calls for 
a heuristic to decide when to apply the optimization technique. 
Empirically we observe that refutations with larger clauses 
(more literals) require longer to check. We hypothesize that this 
occurs because larger clauses are less likely to contribute to unit- 
propagation while simultaneously consuming more memory 
in the cache of the refutation checker. Therefore, optimizing 
refutations with large clauses should yield the greatest benefit. 
To implement this, we introduce a threshold parameter CLayg. 
After each stitching step, the refutation is optimized only if the 
average clause length in the refutation is greater than CLayg. 


g | 72 || 720 [mu | 


©) 


Fig. 3: Refutation stitching process for the SAT instance shown in Figure 1. The decision literals are omitted. 


IV. IMPLEMENTATION 


In this section, we describe our implementation of the Proof- 
Stitch algorithm. Proof-Stitch is implemented in Python and 
uses drat-trim [12] to optimize refutations. Our tool comprises 
of just under 300 lines of Python code and is available on 
GitHub [11]. 

The tool inputs are the original SAT instance in CNF form, 
the refutations and cubes for each sub-problem, and the thresh- 
old value CLavg. Our implementation requires that the cube of 
each sub-problem be encoded in the name of the corresponding 
refutation file. For example, the refutation file corresponding to 
refutation 79 in Figure 1 is named ¢,_€5.proof. The output is 
a single file containing a refutation of the original instance. As 
noted in section III, stitching operations at the same depth of 
the decision tree are independent and their combined refutations 
can be optimized in parallel. Our tool supports this. Setting 
the parameter C'L,,, = 0 enables optimization after every 
stitching operation and CLa,g = —1 turns off optimization 
(only stitching is performed). We denote refutations combined 
with CLavg = 0 as "fully optimized" and refutations combined 
with CLavg = —1 as “unoptimized". 


V. EXPERIMENTS 


To evaluate Proof-Stitch, we run it on six benchmarks 
from the parallel track of last year’s SAT competition [13]. 
The chosen benchmarks can be solved by Paracooba [21] 
within 1 minute of run-time. We also attempted running 
the tool on harder instances from the parallel track. While 
unoptimized proofs can be produced quickly (within a few 
minutes) on those instances, proof-checking and optimization 
are both computationally prohibitive due to the limitation of 
the underlying proof-checker (e.g., drat-trim fails to validate 
the combined refutations on harder instances even with a 
24 hour time limit). For large refutations, the proof-checker 
faces memory and run-time bottlenecks on almost all the 
intermediate optimization steps. Therefore, we do not consider 
harder instances in our evaluation, but note that the proposed 
techniques in principle apply to larger instances once the 
scalability of the underlying proof-checker improves. 

In our experiments, we compare the checking time and size 
of unoptimized refutations against fully optimized refutations 
to show the benefit of optimization. We also report the tool 
run-time to demonstrate that Proof-Stitch does not introduce 
unacceptable overheads. Finally, we analyze the average 
checking time and tool run-time for Clay, = 10, a value 
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TABLE 1: Refutation checking time (To) (s), tool run-time (T3) 
(s), and size of refutation file (S) (MB) for six benchmarks 
from last year’s SAT competition [13] 


Un-optimized Fully Optimized 


Berichmiarks Te(s) Ty(s) Sg(MB) Te(s) To(s) Sg(MB) 
p01_lb_05 987 271 1700 141 686 184 
ktf_TF-4.tf_2 0.02.18 212 78 385 76 600 71 
satch2ways12u 1370 275 1600 272 836 655 
pb_300_10_Ib_06 163 107 536 36 459 27 
mp1-Nb6T06 241 106 586 44 201 222 
E02F17 417 223 1500 112 467 294 


empirically determined to perform well. We perform our 
evaluation on an Intel Xeon E5-2640 v3 machine with 128 
GBytes of DRAM and 16 cores. 

Table 1 shows the time required for drat-trim to check 
the final refutations for the benchmarks (7), tool execution 
time to combine refutations (7%), and the size of the combined 
refutations (.S,). The time required to check refutations reduces 
by between (2.7 — 7)x for all the benchmarks when full 
optimization is performed. Full optimization also results in 
smaller refutation file sizes, but increases the tool run-time. 

Figure 4 compares the average run-time to combine refuta- 
tions (denoted “merging” time) and the average run-time to 
check refutations for unoptimized, CLa,, = 10, and fully 
optimized refutations. Interestingly, running our tool with 
CLavg = 10 decreases the total validation time (merging + 
checking) compared to the unoptimized case. This points to 
the benefit of optimizing refutations in parallel—the overhead 
associated with optimizing refutations can be amortized by 
the savings in refutation checking time. Another important 
observation is that setting CLavg = 10 reduces the time 
required to combine refutations compared to the unoptimized 
case. We believe the reason is as follows: optimizing refutations 
decreases their size. When CLavwg = 10, we optimize all 
intermediate refutations with average clause length greater 
than 10. Since the intermediate refutations are now smaller, 
the next stitching operation on this refutation takes lesser time. 
The time spent in optimizing refutations is mitigated by the 
savings in stitching time. 


VI. CONCLUSION 


We have presented Proof-Stitch, a technique that comple- 
ments Divide-and-Conquer SAT solvers by combining sub- 
problem refutations into a single refutation for the original 
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Fig. 4: Average merging time and refutation checking time 
when the refutations are not optimized, optimized with 
CLavg = 10 and fully optimized 


instance. Proof-Stitch also uses existing proof-trimming tools 
to optimize the combined refutation. 

Future Work: Proof-Stitch’s run-time overhead can be 
reduced by performing more stitching operations in parallel. 
Currently, only stitching operations at the same tree depth are 
parallelized, while in principle, any two independent stitching 
operations could be parallelized. Another potential future 
direction would be to incorporate parallelism in the refutation 
checker itself, likely requiring extension of the DRAT format to 
incorporate structural information of the search tree. Finally, it 
would be interesting to evaluate alternative measures for guiding 
the optimization process, such as Literal Block Distance [25], 
and to look into additional ways to reduce refutation sizes. 

Acknowledgement: This work began as a course project for 
Caroline Trippel’s CS357S (Fall 2021) at Stanford University. 
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Abstract—In software development, verified compilers like 
the CompCert compiler and the CakeML compiler enable a 
methodology for software development and verification that allows 
software developers to establish program-correctness properties on 
the verified compiler’s target level. Inspired by verified compilers 
for software development, the verified Verilog synthesis tool 
Lutsig enables the same methodology for Verilog hardware 
development. In this paper, we address how Verilog features 
that must be understood as hardware constructs, rather than as 
software constructs, fit into hardware development methodologies, 
such as Lutsig’s, inspired the development methodology enabled 
by software compilers. We explore this issue by extending 
the subset of Verilog supported by Lutsig with one such 
feature: always_comb blocks. In extending Lutsig’s Verilog 
support with this, seemingly minor, feature, we are, perhaps 
surprisingly, required to revisit Lutsig’s methodology for circuit 
development and verification; this revisit, it turns out, requires 
reconciling traditional Verilog development and the traditional 
program-verification methodology offered by verified software 
compilers. All development for this paper has been carried out 
in the HOL4 theorem prover. 

Index Terms—hardware development, hardware synthesis, 
Verilog 


I. INTRODUCTION 


In software development, verified compilers enable the 
following interactive-theorem-proving-based verified-program 
development (VPD) methodology: 


1) develop and compile your program in the same way as 
when using an unverified compiler; 

2) prove a source-level correctness theorem about your 
program (by whatever means you have available — the 
methodology is independent of how the correctness 
theorem is established); and, lastly, 

3) transport the source-level program-correctness theorem 
down to your verified compiler’s target level by simple 
composition of the source-level program-correctness 
theorem and the compiler’s (program-independent) cor- 
rectness theorem. 


VPD has been successfully deployed in many different 
software contexts, such as e.g. imperative programming [1], 
functional programming [2], concurrent programming [3], 
just-in-time compilation [4], [5], compiler-implementation 
correctness (by compiler bootstrapping) [2], [6], usability 
such as compositional/separate compilation [7], security such 
as constant-time preservation [8], and performance such as 
time/space reasoning [9]—[11]. 
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In this paper, however, our interest lies in hardware devel- 
opment rather than software development. Previous work on 
verified hardware-synthesis tools [12]—[15] — also known as 
hardware compilers — show that VPD is equally applicable 
to hardware contexts, thereby providing a methodology for 
circuit development and verification. In this paper, we augment 
existing work on VPD in hardware contexts by considering 
source-level language Verilog features that must be understood 
as hardware constructs rather than as software constructs. 


To handle such hardware constructs, we propose a hardware 
development methodology combining VPD and traditional 
Verilog development (TVD). While radical methodological 
redesign is certainty a worthwhile enterprise [16]-[26], we 
here dedicate our energy towards an enterprise in which we 
want to maintain as much as possible of the look-and-feel of 
both VPD and TVD. Specifically, as we further elaborate in 
the next section (Sec. II), we want to maintain both (1) VPD’s 
ability to transport source-level correctness theorems down to 
the compiler’s target level and (2) TVD’s synthesis-modeling- 
idiom-based approach to synthesis. 


We validate the proposed methodology combining VPD and 
TVD by adapting and extending Lutsig [14], a verified synthesis 
tool for synchronous Verilog designs, for the methodology. 
Specifically, we extend Lutsig’s Verilog support with one of 
Verilog’s features that must be understood as a hardware con- 
struct: always_comb blocks, which allows hardware designers 
to declare that certain parts of their behavioral Verilog code 
are to be synthesized to combinational logic. Combinational 
logic is stateless logic and stands in contrast to sequential logic 
(modeled as e.g. always_ff blocks), which is stateful logic. 


All in all, we make the following two contributions: 


e We propose a development methodology combining VPD, 
i.e. the traditional development methodology based on 
verified compilers, and TVD, i.e. traditional Verilog 
development, in a way that inherits the strengths of both 
and simultaneously avoids their main weaknesses. 

e We validate the methodology by showing that it allows 
us to add support for always_comb blocks to Lutsig, the 
Verilog semantics used in Lutsig, and a proof-producing 
Verilog code generator connected to Lutsig. 


All the work for this paper has been carried out in the HOL4 
theorem prover [27]. All source code and proofs are available 
at https://github.com/CakeML/hardware. 


This article is licensed under a Creative 
BY Commons Attribution 4.0 International License 


II. BACKGROUND: VPD AND TVD 


This section serves two purposes: firstly, it introduces VPD 
and TVD in more detail, and, secondly, it establishes notation 
and terminology used in the rest of the paper. 


A. Verified-program development (VPD) 


We now give a more detailed description of VPD, following 
the exposition of Leroy [1]. In VPD, we start off with a 
source program Ps implemented in a source language S and 
a compiled program Pr implemented in a target language T 
produced by a compiler: Comp Ps = OK Pr. If the compiler is 
unable, for whatever reason, to compile Ps, then a compile-time 
error is reported: Comp Ps = Error. The source language S 
has a semantics Lg, and the target language T’ has a semantics 
Lr. The two semantics Ls and Lr associate sets of observable 
behaviors B to source and target programs. We write P |}; B 
to denote that a program P executes with observable behavior 
B under semantics L. 

We say that a compiler Comp is verified when we have 
proved VPs Pr, Comp Ps = OK Pr = > Ps ~ Pr for 
some notion of semantic preservation ~. The only notion 
of semantic preservation we use in this paper is backward 
simulation: Ps = Pr = > VB, Pr rtr B => Ps Ls B; 
that is, any behavior of the target program must be a behavior 
allowed by the source semantics. 

Compiler users, however, are not ultimately interested in the 
correctness of the compiler Comp they are using; rather, when 
compiling a source program Ps with a compiler, users are 
ultimately interested in the correctness of the target program Pr 
produced by the compiler. This is, of course, also part of VPD. 
Since it is easier to prove the correctness of Ps and transport 
the result to Pp than it is to prove the correctness of Pr directly, 
VPD is as follows: Following Leroy’s exposition, users are 
asked to formalize what they mean by their program being 
correct by providing a predicate Spec over observable behaviors. 
We write P z Spec for VB, P 4 B = > Spec B. Now, 
for a successful compiler run Comp Ps = OK Pr, if the user’s 
compiler Comp has been verified (with backward simulation 
as the notion of semantic preservation), then the user can derive 
Pr =L, Spec (i.e., what the user is ultimately interested in) 
from Ps =z, Spec by simple composition. 


B. Traditional Verilog development (TVD) 


We now turn to TVD. As Weste and Harris [28, p. 699] 
put it, hardware description languages (HDLs) like Verilog are 
“better understood as shorthand for describing digital hardware” 
than programming languages. Continuing, Weste and Harris 
describe TVD as follows: 

1) “[...] begin your design process by planning, on paper 

or in your mind, the hardware you want.” 

2) “Then, write the HDL code that implies that hardware 

to a synthesis tool.” 


In TVD, an important concept is modeling idioms, which 
enable the hardware designer to express not only the behavior 
of their design but what kind of hardware they want. Modeling 
idioms are what allow the hardware designer to write Verilog 
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code that “implies” the hardware design the hardware designer 
has formed “on paper or in [their] mind.” 

Examples of modeling idioms include e.g. always_ff and 
always_comb blocks, allowing hardware designers to specify 
if sequential or combinational logic should be inferred by the 
synthesis tool. In general, what modeling idioms are available 
depends on what technology is targeted. E.g., the synthesis 
manual for Xilinx’s (unverified) synthesis suite Vivado [29, 
p. 111] contains modeling idioms and guidelines for modeling 
block RAMs (BRAMs), a type of memory available in Xilinx 
FPGAs. The modeling idioms related to BRAMs are presented 
as Verilog design fragments, instructing the hardware designer 
how to write their Verilog code such that the synthesis tool will 
infer features such as write enable inputs, byte-write-enable 
inputs, optional output registers, etc. 


III. RECONCILING VPD AND TVD 


Having introduced both VPD and TVD, we are now in 
a position to combine the best of two worlds: we want 
the methodology for circuit development and verification 
offered by Lutsig to provide the strengths of both VPD, i.e., 
theorem transportation, and TVD, i.e., synthesis-tool control 
by modeling idioms. 

As a first step, as we want to apply the VPD methodology to 
Verilog hardware development, we must specialize Comp, S, 
Lg, T, and Lr to appropriate hardware instances. Since we, in 
this paper, are working with Lutsig, we set: Comp = Lutsig, 
S = Verilog (abbreviated “ver”), and T = technology-mapped 
netlists for (a class of) FPGAs (abbreviated “nl’”). For Lr, 
Lutsig uses a simple netlist language. What remains to specify 
is Lg — and this is where our problems begin. 

The problems surrounding Lg arise from the fact that, 
traditionally conceived, Verilog has two semantics: one simu- 
lation semantics and one synthesis semantics. The reason for 
having two semantics, we will see, is TVD. This, however, 
does not fit cleanly into VPD since in VPD the source 
language S is supposed to have one and only one semantics 
Lg; since otherwise theorem transportation cannot be carried 
out by simple composition. 

We now discuss the two semantics in the context of synthesis 
tool design and how they relate and fit into VPD and TVD. 
We first introduce the two semantics, we then survey the state 
of the art, and then conclude by stating how our development 
methodology — combining VPD and TVD - as implemented 
in Lutsig contributes to the state of the art. 

Simulation semantics. The simulation semantics is given 
by the (System)Verilog standard [30]. The semantics is large, 
complicated, and full of gotchas [31], but at the end of the day, 
is an informally specified event-based operational semantics. 

Synthesis semantics. The situation for the synthesis seman- 
tics is less straightforward. 

Firstly, one minor hurdle to overcome is that the authoritative 
source for the semantics is unclear. Since the Verilog standard 
does not provide a synthesis semantics and the Verilog synthesis 
standard [32] has been withdrawn, it is up to each synthesis tool 
to provide their own synthesis semantics. Current tool-specific 


synthesis manuals, such as e.g. the synthesis manuals for 
Vivado [29] and Quartus [33], however, largely contain similar 
material as the withdrawn synthesis standard (similar modeling 
idioms, design and coding-style recommendations, etc.), except 
specified in a more detailed fashion since such manuals are 
both tool- and target-technology-specific. We therefore use 
the withdrawn Verilog synthesis standard as the basis for our 
discussion here. 

Secondly — the major hurdle — the synthesis semantics, both 
as specified in the synthesis standard and the tool-specific 
synthesis manuals, is not a full semantics like the simulation 
semantics; rather, it is just a collection of modeling idioms 
and design recommendations built on top of the simulation 
semantics. This ends up causing problems since some of the 
modeling idioms prescribe semantics incompatible with the 
simulation semantics: specifically, some of the modeling idioms 
have not only nonfunctional consequences but also functional 
consequences; in other words, some modeling idioms have 
consequences for the (functional) behavior of synthesized 
circuits! In TVD, the problems this causes are known as 
simulation-and-synthesis mismatches. Some mismatches are 
highlighted in (the informative) App. B in the synthesis 
standard. E.g., we are warned that the following module! 
will cause a simulation-and-synthesis mismatch since the 
assignments to y and tmp are “mis-ordered” (since the block 
is supposed to describe combinational logic — that is, stateless 
logic — and tmp is read before being assigned): 


module andorlb(output reg y, 
reg tmp; 


input a, b, c); 


always @« begin 


y = tmp | c; 
tmp = a & b; 
end 
endmodule 


State-of-the-art VPD. To some extent, VPD and TVD were 
reconciled already in the first version of Lutsig. However, 
except for X assignments, which, according to the synthesis 
standard, “tells the simulator to treat the signal as having 
an unknown value and tells the synthesis tool to treat the 
signal as a don’t care” [32, p. 106], not much attention was 
directed towards simulation-and-synthesis mismatches. This 
was because the supported subset of Verilog was sufficiently 
small and software-like that the parts of Verilog that risk causing 
simulation-and-synthesis mismatches were, in effect, avoided.? 

Now, on the other hand, when adding support for 
always_comb to Lutsig, i.e., a feature that must be understood 
as a hardware construct rather than as a software construct, i.e., 
a feature that must be understood in terms of modeling idioms, 
further reconciliation between VPD and TVD is needed. At 
the same time, we should acknowledge that problems similar 


'Here presented verbatim, using an always @ block rather than an 
always_comb block since the synthesis standard was published before the 
first SystemVerilog standard — the synthesis standard based on the Verilog 
2001 standard [34]. 

Clearly, a discussion concluding “Lutsig takes Verilog’s simulation se- 
mantics as its synthesis semantics” [14, p. 50] is insufficient for handling 
always_comb blocks. 


to our present problems can be found in software development 
as well. E.g., one aspect of what has happened is that we have 
ended up with nonfunctional expectations on our synthesis tool 
— and VPD, in its minimal incarnation, only covers functional 
expectations, specifically semantics preservation. Nonfunctional 
expectations are, of course, sometimes put on software com- 
pilers [35], since functional software-compiler guarantees say 
(most commonly) nothing about code size, memory usage, 
cache performance, overall performance, security, etc. Indeed, 
some of the software VPD work mentioned in the introduction 
provide examples of VPD work addressing nonfunctional 
properties, such as security [8] and space reasoning [9]. 

Another point of comparison is how so-called undefined 
behavior (UB) is handled in languages such as C [36], [37]. 
UB leaves some parts of the language in question left with 
unspecified semantics (to allow for compiler optimizations). 
UB forms a subset of the language to avoid. Simulation-and- 
synthesis mismatches are similar to UB in the sense that sources 
of such mismatches can be seen as parts of Verilog to avoid. 
However, the two are not equivalent since the concept that 
induces simulation-and-synthesis mismatches, modeling idioms, 
has no analog in UB-based approaches to language semantics. 

Recall that we aim to keep the look-and-feel of TVD in Lut- 
sig’s combination of VPD and TVD. We therefore must include 
modeling idioms in Lutsig’s synthesis methodology rather than 
try to formulate a synthesis story under a — potentially more 
familiar for software developers — UB framework. 

State-of-the-art TVD. Today’s commercial (unverified) 
synthesis tools leave much to be desired; within the same 
tool, simulation-and-synthesis mismatches are handled along 
the whole spectrum of: silently miscompiling Verilog designs, 
issuing warnings, and aborting the compilation process entirely. 
In consequence, the result of a successful synthesis run is 
unclear for hardware developers: since an error-free synthesis 
run does not guarantee an actually successful synthesis run, 
some form of postsynthesis inspection, e.g. testing or manual 
visual inspection, is needed to ensure that the functional and 
nonfunctional properties we are interested in survived or were 
established during synthesis. 

Lutsig’s methodology. The conclusion we draw from the 
above discussion is that, to handle both TVD and VPD, 
Lutsig must implement both Verilog’s semantics: the simulation 
semantics for VPD-style theorem transportation, and the 
synthesis semantics, in the form of synthesis idioms, for 
synthesis-idiom-based TVD. 

In Lutsig, TVD is handled on an informal best-effort basis, 
since strict compliance prohibits too many optimizations, and 
VPD is handled, as it must, formally. 

An interesting question is how much of TVD can be 
handled formally. For this paper, to illustrate that part of 
TVD can be treated formally, the feature of focus of this 
paper, always_comb blocks, diverges in Lutsig from the above 
general pattern of treating TVD informally: we prove that if the 
two semantics assign different behaviors to an always_comb 
block (e.g., because of “mis-ordered” writes) in a given input 
design, then Lutsig will abort — since Lutsig cannot abide 
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by both semantics if they point in different directions. It is 
Lutsig’s two top-level theorems (Sec. VIII and IX) that together 
formally show that Lutsig successfully handles both semantics 
for always_comb blocks. We leave the consideration of other 
synthesis idioms as future work. 

Lutsig’s contribution to establishing functional properties. 
Like for the first version of Lutsig, we have proved that Lutsig 
is semantics preserving (Sec. VIII). Specifically, after our 
discussion, it should now be clear that Lutsig must be semantics 
preserving with respect to Verilog’s simulation semantics. We 
call Lutsig’s formalization of the simulation semantics Lyer; 
i.e., in terms of VPD, we have Ls = Lye. The semantics 
is the same Verilog semantics used as in the first version of 
Lutsig, with the exception that we now have added support for 
always_comb blocks (as described in Sec. V). 

Since Lutsig allows for VPD development, after the hardware 
designer has transported a source-level correctness theorem 
down to the netlist level, the designer can rest assured that 
the synthesis process has not introduced any functional bugs. 
For functional correctness, VPD effectively forces Lutsig to 
adopt (in stark contrast to other Verilog synthesis tools) a 
uniform error handling mechanism: if Lutsig cannot guarantee 
semantics preservation, it must abort. Like the first version of 
Lutsig, and other verified compilers and synthesis tools, silent 
miscompilation is guaranteed to never occur. 

Lutsig’s contribution to establishing nonfunctional prop- 
erties. We improve the state of the art in establishing nonfunc- 
tional hardware property by proving that Lutsig’s synthesis 
algorithm correctly implements the modeling idiom that 
always_comb must generate combinational logic (Sec. IX), 
i.e., enables proven-correct TVD for always_comb blocks. 
For other modeling idioms, Lutsig does not improve the state 
of the art with respect to establishing nonfunctional properties. 

Other approaches to circuit correctness. The first Lutsig 
paper [14] compares VPD-style hardware development, as 
followed here, to other approaches to circuit correctness, such 
as translation validation (known as formal equivalence checking 
in the hardware world), so we do not repeat that discussion here. 


IV. USING LUTSIG IN PRACTICE 


The rest of the paper consists of putting the discussion up 
till now into practice by adding support for always_comb to 
Lutsig and surrounding components. But before heading into 
technical details, we show how all pieces of the development 
fit together by demonstrating how hardware designers can use 
Lutsig in combination with a proof-producing Verilog code 
generator, developed in conjunction with Lutsig, to transport 
correctness properties down to the netlist level. 


3We emphasize that what is demonstrated here is one of multiple potential 
use cases of Lutsig. Like any Verilog synthesis tool, Lutsig can be made 
part of different hardware-development flows. In particular, one can imagine 
many different front-ends capable of generating Lutsig Verilog ASTs and, in 
various ways, producing proofs of correctness for those ASTs. In this paper, 
the proof-producing code generator we use fits our purposes here. Someone 
wanting to verify and synthesize existing Verilog code will have other needs. 
For developers not interested in verification at all, there is a (unverified) 
Verilog-text-file front-end for Lutsig available such that Lutsig can be used 
like a conventional Verilog synthesis tool. 
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module avg(input logic clk, 


input logic[7:0] signal, 
output logic[7:0] avg); 
logic[7:0] h0 = 0, hl = 0, h2 = 0, h3 = 0; 
always_ff @(posedge clk) begin 
h0 <= signal; hl <= h0; h2 <= hl; h3 <= h2; 
end 
always_comb begin 
avg = h0 + hl + h2 + h3; 
// Div by 4 by shifting 
avg[0] = avg[2]; avg[1] = avg[3]; avg[2] = avg[4]; 
avg[3] = avg[5]; avg[4] = avg[6]; avg[5] = avg[7]; 
avg[6] = 0; avg[7] = 0; 
end 
endmodule 


Fig. 1. Example Verilog module 


Example module. The Verilog module in Fig. 1, imple- 
menting a moving-average filter, serves as a running example 
in this section. The module utilizes Lutsig’s new support 
for always_comb blocks. Sec. V provides more details 
on Lutsig’s Verilog support. 

Proving Verilog designs correct. Lutsig is accompanied by 
a proof-producing Verilog code generator. The code generator is 
explained in more detail in Sec. VI. In short, the code generator 
constructs a Verilog module Pye given a HOL embedding 
Pao of a Verilog circuit. As the code generator is proof- 
producing, the code generator enables hardware designers to 
transport properties proved about the input HOL circuit Pok, 
e.g. Puyo: FLyo, Spec, to the generated Verilog module Pyer, 
ie. Pyer Fx,,, Spec, by simple composition. 

The Verilog module in Fig. 1 was in fact generated by 
the code generator from a HOL circuit. With the help of the 
code generator, we have proved that, if we by s[n] mean 
the value of signal s at clock cycle n, the generated Verilog 
module satisfies the specification (in 8-bit modular arithmetic) 
aa signal[n — i] 


, i.e., the module is correct. 


avg[n] = 

Going to the netlist level. Now having both a Verilog 
module (Fig. 1) and a correctness result for the module, 
we can synthesize a netlist implementation of the module, 
by invoking Lutsig, and transport the correctness result to 
the netlist implementation, by composing the Verilog-level 
correctness result with Lutsig’s correctness theorem (i.e., in 
general notation, derive Py Fr, Spec from Pyer =L, Spec). 
We discuss Lutsig in more detail in Sec. VII and the functional 
correctness of Lutsig in Sec. VIII. Since the behavior of the 
variable avg is specified using an always_comb block, no 
register should be generated for the variable; this is further 
discussed in Sec. IX in the context of the nonfunctional 
correctness property we have proved about Lutsig. 

FPGAs. At this point, our formal development ends. To run 
the netlist implementation produced by Lutsig on an FPGA, the 
netlist needs to be placed and routed onto an FPGA chip and 
then encoded into a bitstream for the chip. In our experiments, 


we used the unverified synthesis suite Vivado 2020.2 for these 
last steps. According to our manual testing, the netlist Lutsig 
synthesizes for the Verilog module in Fig. 1 runs correctly on 
top of the FPGA board we used for testing. 


V. FORMAL SEMANTICS 


In this section we first describe the updated source language 
of Lutsig (Sec. V-A); that is, we describe the subset of Verilog 
that Lutsig supports and Lutsig’s Verilog semantics Lyer for 
this subset. We then describe the updated target language of 
Lutsig (Sec. V-B), that is, Lutsig’s netlist language. 


A. Lutsig’s Verilog semantics 


In Lutsig, circuits are represented as Verilog modules. A 
Verilog module, in turn, in Lutsig, consists of: 


a set of input signals (including a clock signal clk), 
a set of variables, some marked externally visible, 

a set of always_comb blocks, and 

a set of always_ff @(posedge clk) blocks. 


Lutsig’s Verilog semantics is a functional operational semantics 
that takes the following four inputs: 


e a Verilog module m to execute, 

e the number of clock cycles n to execute the module, 

e a function fext : N —> string —> value modeling snapshots 
of the nondeterministic world outside the module, and 

e a function fbits : N — bool modeling a stream of 
nondeterministic bits*. 


Since Lutsig’s Verilog must be convenient to use in formal 
reasoning, Lutsig’s Verilog is not, in contrast to full Verilog, 
based on nondeterministic event processing. Since Lutsig 
targets synchronous designs, the complexities of an event- 
driven semantics can be fully avoided. Of particular interest is 
the process-level semantics of Lutsig’s Verilog semantics, since 
the expression-level and statement-level semantics have not 
been updated for this new version of Lutsig. In short, Lutsig’s 
Verilog semantics for executing one clock cycle is: 


e For clock cycle zero, i.e. before the first clock tick, 
initialize all variables (for a variable without a specified 
initial value, assign a nondeterministic value) and then 
run all always_comb blocks in dependency order. 

For all other clock cycles, run all always_ff blocks in 
declaration order followed by all always_comb blocks 
in dependency order. 


A module’s always_ff blocks are, in Lutsig’s Verilog, 
executed in declaration order since the order of execution does 
not affect the final result of execution as long as not more 
than one process writes to the same variable and all writes 
to variables that are read by processes other than the process 
making the writes are nonblocking (a type of assignment used 
for communication between processes in Verilog). 

A module’s always_comb blocks are, in Lutsig’s Verilog, 
executed in dependency order since the order of execution 


4See Lööw [14] for a discussion on how X values are treated in Lutsig. 
We do not repeat the discussion on X values here since such concerns are 
orthogonal to our current concerns. 
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does matter since blocking writes are used even for variables 
shared between processes. All always_comb blocks are sorted 
before execution by their variable dependencies in the sense 
that no process writes to a variable that has been read 
by an earlier process. If the processes cannot be sorted in 
this way, the semantics aborts with an error. Sorting the 
processes complicates the semantics, since a sorting algorithm 
is embedded into the semantics. (We have, however, proved that 
the algorithm sorts correctly.) The sorting algorithm picks one 
particular permutation, but users of the semantics should think 
of it as an arbitrary permutation of the input always_comb 
blocks that satisfy the mentioned dependency-order criteria. 

Our intention is that Lutsig’s non-event-driven Verilog 
semantics should coincide with the event-driven simulation 
semantics of full Verilog, as defined by the Verilog standard, 
as long as good coding style is followed; e.g., as mentioned 
above, not writing blockingly in an always_ff block to a 
variable shared between processes. As part of future work, 
we plan to formally prove a correspondence between the two 
semantics to make the relationship between them more precise. 
Such future semantics work is important for Lutsig when 
arguing that Lutsig is a Verilog synthesis tool, but such work 
is simultaneously independent of Lutsig in the sense that it 
would not require Lutsig’s implementation and proofs to be 
updated, as long as the work does not unveil problems in the 
non-event-driven semantics (and hence requiring us to revisit 
the semantics). 


B. Lutsig’s netlist semantics 


For this version of Lutsig, to support the compilation 
of always_comb blocks, we split netlist registers into two 
groups: pseudoregisters and real registers. Pseudoregisters are 
only needed to represent intermediate compilation results — 
i.e., pseudoregisters are always compiled away before the 
compilation process is finished. We explain how pseudoregisters 
are used in the compilation process in Sec. VII. After adding 
pseudoregisters, a netlist in Lutsig consists of two lists of cells 
and two lists of registers: one list of cells for the real registers 
and one list of cells for the pseudoregisters. 

There is a formal semantics in functional-operational style 
associated with Lutsig’s netlists. The semantics takes the same 
kind of arguments as Lutsig’s Verilog semantics except a 
netlist is given rather than a Verilog module. Netlist execution 
is similar to Lutsig’s Verilog execution. First, we define a 
netlist step to be running all pseudoregister cells, updating all 
pseudoregisters, and then running all remaining cells. Now, with 
this terminology in mind, we can describe the full semantics: 


e For clock cycle zero, initialize all registers and then do a 
netlist step. 

e For all other clock cycles, update all real registers and 
then do a netlist step. 


5Picking one particular permutation rather than an arbitrary permutation 
simplifies some proofs in the development. But since picking an arbitrary 
permutation would simplify the user-facing presentation of the semantics, it 
might be worth revisiting this choice. 


It is important that the netlist semantics is simple since the 
semantics is part of the trusted base of circuits produced with 
the help of Lutsig. In fact, for netlists without pseudoregisters, 
such as the final output netlists generated by Lutsig, it is easy 
to prove that the above semantics collapses into the following 
clean semantics Ly: 


e For clock cycle zero, initialize all registers and then run 
all cells. 

e For all other clock cycles, update all registers and then 
run all cells. 


VI. THE PROOF-PRODUCING VERILOG CODE GENERATOR 


For this paper, we have extended the proof-producing 
Verilog code generator bundled with Lutsig with support for 
translating always_comb blocks, such that we can prove 
circuits containing such blocks correct. 

The code generator can generate a deeply embedded Verilog 
circuit given a shallowly embedded Verilog circuit. To shallowly 
embed a Verilog circuit means to express it as a HOL function 
(i.e., a functional program). Shallowly embedded circuits are 
convenient to work with since HOL4 has well-developed 
infrastructure for reasoning about functional programs. The 
code generator is an SML function which is proof-producing in 
the sense that it, for every run, proves a HOL theorem (using 
the HOL4 API) ensuring that the input circuit and output circuit 
have the same behavior. 

Since the input language to the code generator is Verilog, 
although shallowly embedded, there is no need to provide a 
new set of hardware-modeling idioms (i.e., a new synthesis 
semantics) for the input language. In other words, the input 
circuits should be seen as Verilog circuits, and, when shallowly 
embedding Verilog circuits, according to the style the code 
generator expects, the hardware developer should think of 
themselves as doing Verilog development. 

The code generator assumes that circuits are embedded 
in the style we now describe. Verilog processes must be 
embedded as next-state functions over (module-specific) state 
records. For each process, the generated Verilog code closely 
mirrors the given input HOL function. E.g., recall that the 
always_ff block in the Verilog module in Fig. 1 is simply 
“ho <= signal; hl <= h0; h2 <= hl; h3 <= h2;”; 
the next-state function the block is generated from is: 


def 


avg_ff fert s s! = let 

s’ = s' with hO := fezt.signal; 
s’ = s'withhi := s.h0; 

s! = s' with h2 := s.h1 in 

s’ with h3 := s.h2 


Note how field updates are translated to assignments in Verilog 
in a straightforward manner (the syntax r with f := v means 
that field f of record r is updated to value v). Also note how two 
state records s and s’ are passed around; these two state records 
are the basis of the nonblocking-assignments embedding style 


®Unrelatedly, we have also changed how nonblocking assignments are 
shallowly embedded, such that a larger set of Verilog designs can be embedded. 
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used. The record s contains the values of all variables at the 
start of the current clock cycle, and the record s’ contains 
the current values of all variables. To see why both records 
are needed, consider e.g. the assignments to h0 and h1 in the 
generated always_ff block: since the assignment to hO is 
nonblocking, the updated value of no is not available until the 
next clock cycle, and the HOL embedding of the h1 assignment 
must therefore read the value of no from the s record (not the 
s’ record) to model Verilog’s semantics correctly. 

The rest of the HOL circuit embedding style closely mirrors 
Lutsig’s Verilog semantics. First, there is a function 


/ def 


= g 


procs || fert s s 
procs (p::ps) feat s s! = procs ps feat s (p feat s s’) 


for combining a list of next-state functions into one single 
next-state function. The function allows for building one next- 
state function for all always_ff blocks in the module and 
one next-state function for all always_comb blocks. One 
important caveat is that the always_comb blocks must be 
provided in dependency order, otherwise the HOL circuit 
will not correctly mirror Lutsig’s Verilog semantics since 
Lutsig’s Verilog semantics sorts all always_comb blocks by 
dependency before execution. The resulting two next-state 
functions formed by composing all always_ff blocks and 
always_comb blocks, respectively, using procs, can then be 
given to the following function, also mirroring Lutsig’s Verilog 
semantics, to build a full circuit: 


mk_circuit sstep cstep s fect 0 = cstep (fext 0) s s 
mk_circuit sstep cstep s fext (Suc n) = let 

s = mk_circuit sstep cstep s fext n; 

s = sstep (fext n) s s in 

cstep (fet (Suc n)) s s 


E.g., the HOL representation of the Verilog module in Fig. 1 
is mk_circuit (procs [avg_ff]) (procs [avg_comb]). 

Lastly, one more level of encoding is needed to handle 
variable initialization, which is simple and we do not detail here. 


VII. LUTSIG 


We now discuss Lutsig’s new support for always_comb 
blocks. To simultaneously honor both Verilog’s simulation se- 
mantics and Verilog’s synthesis semantics — in this paper, specif- 
ically, for the latter, the modeling idiom that always_comb 
blocks must always be mapped to combinational logic — Lutsig 
must take on the responsibility to abort if the two semantics 
differ in what semantics they assign to some always_comb 
block in a given design. In this section, we discuss how 
Lutsig implements this responsibility. In Sec. VIII, we show 
that Lutsig successfully achieves its responsibility towards 
Verilog’s simulation semantics, by presenting a theorem stating 
that Lutsig is semantics preserving with respect to Lutsig’s 
formalization of Verilog’s simulation semantics. In Sec. IX, we 
show that Lutsig successfully achieves its responsibility towards 
Verilog’s synthesis semantics (for always_comb blocks), by 
presenting a theorem stating that always_comb blocks are 
never be mapped to registers (or other stateful constructs). 


Concretely, the above responsibility boils down to ensuring 


that there is no sequential logic inside any always_comb block. 


This is where pseudoregisters come in: all variables written to 
by an always_comb block are mapped to pseudoregisters, 
and all other variables are mapped to real registers. All 
pseudoregisters must then be compiled away before the 


synthesis process is over, otherwise Lutsig aborts with an error. 


A. Variable-level and element-level analysis 


To keep the implementation of Lutsig simple, the decision 
whether to map a variable to a pseudoregister or a real register 
is done on the level of variables. E.g., all elements of an 
array variable are either all mapped to pseudoregisters or to 
real registers. In full Verilog, the analysis is instead based on 
longest static prefixes [30, p. 282]. Such more fine-grained 
analysis allows for different parts of an array to be mapped to 
different kinds of logic, which could possibly be practically 
useful, but would clutter the solution presented here without 
providing additional insight. 

Note, however, that some amount of element-level analysis 
is still needed. E.g., consider a module containing only one 
variable a with type logic[1:0] and the following block: 
always_comb begin 


a[O] = inp0; 
a[l] = inpl; 
end 


The block represents combinational logic since all elements 
of the array are assigned. But if one of the assignments 
would have been left out, then the block would not represent 
combinational logic. Hence, an analysis on the element level 
cannot be fully avoided. 


B. Lutsig’s synthesis passes 


In Lutsig, pseudoregisters are removed at a late stage in the 
synthesis pipeline. The following pipeline passes in Lutsig are 
important for our discussion here: 


SYNT Synthesize the given Verilog design to a netlist 
REM Remove unused registers (variable-level analysis) 
DET Remove all nondeterminism from the netlist 
MAP Compile and technology-map away array cells 
REM Remove unused registers (element-level analysis) 


Pseudoregisters are introduced in SYNT and not removed until 
MAP. Since MAP is done on the element level (rather than 
the variable level as the passes before it), it was natural to 
place the removal of pseudoregisters there. The downside of 
this approach is that we had to update all intermediate passes 
of Lutsig, such as REM and DET, to handle the more complex 
netlist semantics with pseudoregisters. (Note that REM is run 
twice, which we motivate in the next section.) 


C. Problems in compiling combinational logic 


We now highlight how Lutsig handles some of the problems 
related to compiling combinational logic. Our presentation is 
example driven and many of the examples relate to detecting 
simulation-and-synthesis mismatches. It is important to consider 
not only designs that are rejected by Lutsig but also designs 
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that are accepted, since compiler-correctness theorems like 
Lutsig’s (of the form Comp Ps = OK Pr = > Ps ~ Pr) 
do not protect against compiler bugs that cause compilers to 
fail on valid input code (i.e., bugs causing the compiler to 
return Error when it should have returned OK). To exemplify, 
consider the extreme case of a compiler that always returns 
Error: such a compiler is vacuously correct, but, of course, 
not particularly useful. 

1) Combinational logic in always_ff blocks: Code inside 
always_comb blocks must always represent combinational 
logic only, but code inside always_ff blocks can represent 
both combinational and sequential logic. E.g., consider a 
module consisting of three variables a, b, and c with type 
logic[1:0] with one single block: 


always_ff @(posedge clk) begin 
a = inpod; 

b[O] = inpl; 

b[1] = inp2; 

c <= a + b; 
end 


Such code should not generate registers for a and b since those 
registers would never be read. REM makes sure the registers 
for a and b generated by SYNT are optimized away before 
the synthesis process is over. REM is run twice since we want 
to catch easy cases (such as a in the example) early but at the 
same time also make sure to catch cases requiring element-level 
analysis (such as b in the example). 

2) Sequential logic in always_comb blocks: Lutsig must 
check that all always_comb blocks actually model combina- 
tional logic. E.g., Lutsig must reject the following block: 
always_comb a a + 1; 

For this paper, we have extended MAP to handle this. 

MAP handles the compilation of netlist-level array constructs 
such as array cells and array registers, by mapping them to 
array constructs natively available or to Boolean subcircuits. 
MAP is centered around a map o from cell inputs to lists of 
“marked” cell inputs. MAP visits all netlist cells in order and 
the map ø is updated as the netlist is visited to keep track of 
mapped cells. For real registers, all inputs are marked legal 
from the start of compilation. For pseudoregisters, all inputs are 
initially marked as illegal inputs. If an illegal input is referenced 
during compilation (i.e. the (relevant part of the) o entry for 
the cell input is marked illegal), the compilation is aborted. 

We now consider two examples. First, note that the reference 
to a on the right-hand side in the above always_comb block 
will cause the compilation to abort. Now, instead consider 
the following Verilog code exemplifying code Lutsig accepts 
(although note that the illustration is done on the Verilog level 
rather than on the netlist level that MAP is actually run at): 
always_comb begin 

// since b is a pseudoregister, 

// sigma (b) [illegal, illegal] 


we have: 


// sigma (b) 
// sigma (b) 


inp0; 
inpl; 


[illegal, inp0] 
[inpl, inpo] 


// we can read the full b here since all 
// elements of b are legal 


end 


Note that since nonsynthesizable code is rejected by Lutsig, 
it is not important what semantics Lutsig’s Verilog semantics 
assigns to nonsynthesizable code. For some nonsynthesizable 
code, Lutsig’s semantics diverges from Verilog’s simulation 
semantics. E.g., recall that all blocks are unconditionally 
executed each clock cycle in Lutsig’s semantics. In contrast, 
in Verilog’s simulation semantics, always_comb blocks are 
only executed when something they depend on is updated. 
But since combinational logic is idempotent — that is, we 
can execute it multiple times without affecting the result — 
executing the same always_comb multiple times is harmless. 
However, if the always_comb block does not actually model 
combinational logic, this reasoning does not hold, and the two 
semantics might diverge. 

3) Intrablock order problems: Recall the andor1b module 
with “mis-ordered” assignments discussed in Sec. II. The o- 
based MAP pass also handles such code correctly. E.g., Lutsig 
rejects the following code with the same problem: 
always_comb begin 


b =a + 1; // sigma(a) says a illegal here! 
a = inp; 
end 


4) Interblock order problems: Recall that Lutsig’s non-event- 
based Verilog semantics sorts always_comb blocks before 
execution (see Sec. V). E.g., to assign sensible semantics to 
the following code, the order of the blocks needs to be reversed 
before execution: 


always_comb b 
always_comb a 


a+ li; 
= inp; 

The same order problem occurs in compilation: To compile the 
above code correctly, Lutsig must first sort the always_comb 
blocks by their dependencies. To sort, Lutsig uses the same 
sorting algorithm as used in Lutsig’s Verilog semantics. 

Not all processes can be ordered by their dependencies. Since 
combinational logic must not include combinational loops, the 
sorting algorithm used in Lutsig rejects code containing circular 
dependencies like the following: 

always_comb a = b + 1; 
always_comb b = a + 1; 


5) If statements: Lutsig handles if statements correctly. E.g. 
the following code is rejected: 
always_comb 


if (c) 

a = inp; 
//else 
// a= 'X; 


If instead the else branch is uncommented, then Lutsig 
synthesizes the code successfully. The original block without 
an else branch gets stuck in the synthesis process since SYNT 
generates a mux with inp and the pseudoregister generated for 
a as inputs and MAP eventually detects that a pseudoregister 
is referenced and aborts the synthesis process. 

6) Case statements and nested if statements: Compiling case 
statements is similar to compiling if statements: if a variable 
is assigned in one branch, then it must be assigned in all other 
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branches as well. Let the variable c have type logic[1:0] 
and consider the following code: 


always_comb 
case (c) 


New 


`~ 


INE AeH 


' 


//default: 
endcase 
A sufficiently smart synthesis tool would realize that a is 
assigned for all possible values of c. However, Lutsig’s syn- 
thesis algorithm is not smart and requires the commented-out 
default branch above to realize that all cases are covered. The 
same holds for the analogous situation with nested if statements. 
In fact, Lutsig handles case statements by expanding them to 
nested if statements, so Lutsig’s limited case statement handling 

is a consequence of Lutsig’s limited if statement handling. 


X; 


VIII. FUNCTIONAL CORRECTNESS OF LUTSIG 


We now state Lutsig’s functional-correctness theorem, 
thereby showing that Lutsig successfully abides by (its for- 
malization of) Verilog’s simulation semantics. The theorem 
statement is the same as in the previous version of Lutsig; the 
HOL4 proof of the theorem, however, has been updated to take 
into account the new functionality added in this paper. If we let 
P |" S denote that design P’s externally visible state is S 
under the semantics L after n clock cycles with nondeterminism 
source fbits, then Lutsig’s correctness theorem is as follows: 


Lutsig Pr = OK Py => 


J „fbi 
Jsn, Pu T a Shi A 


. „fbits’ 
fbits’, Pyer yo” K Sher => Sni = Syer 


Per the usual convention, all free variables in the theorem are 
implicitly universality quantified. Note that since the netlist 
Py in the theorem statement never contains pseudoregisters, 
we can use the simplified netlist semantics La which does 
not handle pseudoregisters. 

Although the theorem statement is more complex than 
straightforward backward simulation as presented in Sec. II-A, 
the theorem still allows for theorem transportation from the 
Verilog level down to the netlist level by simple composition 
(i.e., VPD): Given a circuit-correctness theorem stating that a 
Verilog module Pyer never crashes (regardless of what fbits is 
supplied), say 3Syer, Pyer CC Syer Spec Sver for some spec- 
ification Spec, if Lutsig successfully synthesize Pyer to a netlist 


Py, then we can easily derive 4S), Pu yrs bits S A Spec Sa. 


IX. NONFUNCTIONAL CORRECTNESS OF LUTSIG 


We now turn to the nonfunctional correctness of Lutsig. 
Recall that Verilog’s synthesis semantics enables hardware 
designers to express hardware design ideas to their synthesis 
tool through modeling idioms. The theorem presented in 
this section, which we have proved in HOL4, shows that 
Lutsig correctly handles always_comb blocks in the sense that 
the theorem captures the modeling idiom that always_comb 
blocks must be mapped to combinational logic [30, p. 207]. 


We formalize this modeling idiom as follows: for any run 
Lutsig Per = OK Py, if a variable is written to in an 
always_comb block in Pyer, then no register with the same 
name as the variable will be included in P,. Formally, the 
theorem is as follows: 


Lutsig Per = OK Py => 
Vvar, var E€ comb_vars Pe => var ¢ regs Py 


Note that the theorem relates concepts in the input design Pyer 
(writes) to concepts in the final netlist Py (registers) — this 
means that we must, in our proofs, carry information from the 
very first compilation phase down to the very last.’ 


X. CONCLUSION 


We now conclude. In our discussion on the relationships 
between Verilog’s simulation semantics, Verilog’s synthesis 
semantics, VPD, and TVD, we identify Verilog’s modeling 


idioms as the core cause of tensions between VPD and TVD. 


To put our discussion to test, we have added support for 
always_comb blocks to the verified synthesis tool Lutsig. 

Our discussion on VPD and TVD paves the way for further 
Lutsig extensions that add support for Verilog constructs 
associated with simulation-and-synthesis mismatches, such as 
support for BRAM inference. 

Another interesting direction for future work to explore is 
how a more detailed hardware semantics would affect the 
always_comb discussion. In this paper our Verilog semantics 
is at the level of cycle-by-cycle behavior — what are the 
alternatives for a more detailed hardware semantics that, 
while at the same time as keeping source-level reasoning 
feasible, allow us to turn the nonfunctional property we have 
proved in this paper into a part of the compiler’s functional 
correctness theorem? 

Lastly, no approach to hardware development, regardless 
of hardware language used, completely shields the hardware 
designer from the synthesis aspects we have discussed in this 
paper. It would therefore be interesting to consider how much 
of our discussion on VPD and TVD translates into hardware 
development and synthesis-tool verification for other hardware 
languages. The questions we raise in this paper will reappear in 
similar form regardless of the hardware language used. After all, 
not even so-called high-level synthesis (HLS), i.e., generating 
hardware from software languages like C, can completely 
hide the synthesis process from hardware developers. E.g., the 
manual [38, p. 17] for Vitis, an unverified HLS tool for C, C++, 
and OpenCL, states that “arbitrary, off-the-shelf software cannot 
be efficiently converted into hardware” and that, moreover, 
“even if [a] software program can be automatically converted 
(or synthesized) into hardware, achieving acceptable quality 
of results, will require additional work such as rewriting the 


7Before we started working on the proof, Lutsig did not actually satisfy our 
formalization of the always_comb modeling idiom. This was because the 
SYNT pass (see Sec. VII) used the presence of writes in the design that was 
given to that pass to decide which variables to map to real registers and which 
to pseudoregisters rather than the presence of writes in the design as given by 
the user (i.e., Pyer in the above theorem) — the former does not reliably track 
the latter since writes may be optimized away in the compilation process! 


97 


software to help the HLS tool achieve the desired performance 
goals.” The pessimism of the manual [38, p. 28] continues: 
“Software written for CPUs and software written for FPGAs 
is fundamentally different. You cannot write code that is 
portable between CPU and FPGA platforms without sacrificing 
performance.” To prepare its readers for hardware development 
using Vitis, the manual informs its readers what they need 
to know about the Vitis synthesis process to design efficient 
hardware; in other words, the HLS hardware designer, much 
like the Verilog hardware designer, must be aware of how to 
control their synthesis tool and how to communicate to their 
synthesis tool what kind of hardware they want. In total, the 
Vitis manual is 660 pages, reflecting the fact that not even 
HLS manages to abstract away the complexities of synthesis. 
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Abstract—Symbolic circuit simulation has been the main 
vehicle for formal verification of Intel Core processor execution 
engines for over twenty years. It extends traditional simulation by 
allowing symbolic variables in the stimulus, covering the circuit 
behavior for all possible values simultaneously. A distinguishing 
feature of symbolic simulation is that it gives the human verifier 
clear visibility into the progress of the computation during the 
verification of an individual operation, and fine-grained control 
over the simulation to focus only on the datapath for that 
operation while abstracting away the rest of the circuit behavior. 

In this paper we describe an automated simulation complexity 
reduction method called timed causal fanin analysis that can be 
used to carve out the minimal circuit logic needed for verification 
of an operation on a cycle-by-cycle basis. The method has been a 
key component of Intel’s large-scale execution engine verification 
efforts, enabling closed-box verification of most operations in the 
interface level. 

As a specific application, we discuss the formal verification of 
Intel’s new half-precision floating-point FP16 micro-instruction 
set. Thanks to the ability of the timed causal fanin analysis to 
separate the half-precision datapaths from full-width ones, we 
were able to verify all these instructions closed box, including 
the most complex ones like fused multiply-add and division. This 
led to early detection of several deep datapath bugs. 

Index Terms—¥ormal Verification, Symbolic Simulation, Com- 
plexity Reduction 


I. INTRODUCTION 


Comprehensive formal verification of execution engines 
has been standard practice in virtually all Intel® Core™ and 
Intel Atom® processor development projects in the last two 
decades, and extensive infrastructure has been built to support 
these efforts. Formal verification of Intel processor execution 
engines is primarily based on symbolic circuit simulation, 
a technology extending usual digital circuit simulation with 
symbolic values, representing sets of concrete values in a 
single simulation [1], [2], [3], [4], [5]. 

Full correctness of processor execution engines is indispens- 
able for product quality, as errata in basic execution datapaths 
tend to be both customer visible and un-patchable. Due to the 
size of the data space and the difficulty of identifying and 
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Other names and brands might be claimed as the property of others. 
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covering all internal corner cases with either pre-silicon or 
post-silicon testing, formal verification is the only approach 
that can ensure sufficient quality, especially for complex 
floating point datapaths. 

Execution engines in industrial processor designs typically 
combine a set of different pipelined datapaths into a single 
design component. To minimize circuit size, each individual 
datapath multiplexes logic for a family of related operations, 
controlled by operation-specific control signals. The datapaths 
may support different latencies, with simpler operations ex- 
ecuting in fewer pipestages than complex ones. Many data- 
paths are implemented as straight pipelines, however certain 
operations may use iterative algorithms with feedback loops. 
Designs also usually contain bypass networks that route data 
from the datapath outputs directly back to the inputs, avoiding 
the delay of going through a register file. The execution engine 
in a contemporary Intel processor has several million logic 
gates and hundreds of thousands of flip-flops, and the source 
code for it consists of hundreds of thousands of lines of code 
in a hardware description language. 

Focusing on the verification of an individual operation 
implemented in an execution engine, we can conceptually 
distinguish two different sources of verification complexity: 

1) the inherent complexity of the plain datapath for the op- 

eration, ignoring all other functionality of the execution 
engine, and 

2) the complexity caused by the presence of the rest of 

the execution engine, and its possible effects on and 

interferences with the datapath of the operation. 
As an example of the first, any datapath involving multi- 
plication can be expected to pose a verification challenge, 
irrespective of any surrounding logic. For the second, the 
isolation of the result of an operation in a shared result bus 
depends on the control logic of all the datapaths sharing the 
bus. In a practical verification task, the verification engineer 
faces these two dimensions simultaneously, and the complexity 
caused by the surrounding logic may make the verification 
of even inherently trivial datapaths, such as bitwise OR, 
challenging or infeasible. 

Considering the inherent datapath complexity, without sur- 
rounding environment, the large majority of operations im- 
plemented on an execution engine can be directly verified 
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by symbolic simulation in a closed-box fashion. This is the 
ideal scenario due to the many advantages of closed-box 
verification: a well-defined specification, no need of insight 
into implementation details, and low sensitivity to internal 
design changes. For the most complex operations, especially 
complex floating-point arithmetic such as multipliers, fused 
multiply-adders and dividers, this straightforward approach is 
computationally infeasible, and verification is done by means 
of decomposed reference models, requiring time and both 
design and verification expertise. 

If the plain datapath for an individual operation were to 
be isolated from the surrounding logic, for most operations 
it would be amenable to verification by a variety of tech- 
niques besides symbolic simulation. However, in practice, the 
datapath is tightly enmeshed with the rest of the execution 
engine, and there is no straightforward way to isolate it. In 
this respect, symbolic simulation has a unique advantage over 
many competing verification approaches, such as formal equiv- 
alence verification or traditional model checking: it allows the 
verification engineer to understand the computational progress 
of an operation in the circuit in very concrete terms, to carve 
out a minimal amount of logic that needs to be simulated 
for the datapath of that specific operation, and to efficiently 
abstract away the rest. In other words, symbolic simulation 
provides an effective way to separate the two sources of ver- 
ification complexity. The main technical ingredients enabling 
this ability are discussed in Section II. 

Nevertheless, as execution engines typically implement 
thousands of individual operations, and for each operation the 
datapath controls are wired differently, the cost of the human 
effort to analyze and isolate each datapath becomes a limiting 
factor. 

In this paper we describe an algorithmic technique 
called timed causal fanin analysis to derive a tight over- 
approximation of the circuit logic relevant for the simulation 
of the datapath of an individual operation (Section III). This 
method effectively automates the human process of deter- 
mining the minimal circuit logic for a specific datapath. It 
is based on the use of information from an earlier, more 
abstract and less accurate symbolic simulation run to reduce 
the fanin cone of the logic of interest on a cycle-by-cycle basis. 
The method enables fully automated closed-box verification 
of most operations in an execution engine, not just for an 
isolated datapath, but in the context of the full design unit. It 
is meaningful only in the context of verification by symbolic 
simulation. The method has been a key technical enabler in 
Intel’s large-scale verification initiatives over the span of many 
years [3], [6]. However, the current paper is the first detailed 
exposition of the method in the public domain. 

For a recent example illustrating the effects of timed causal 
fanin analysis, in Section V we discuss the verification of 
the new FP16 floating-point instruction set on a recent Intel 
Core processor design. Since the Intel 8087 floating-point 
co-processor was introduced in 1980, Intel processors have 
supported single, double, and extended precision floating point 
formats. The formal verification of complex operations such 


as multiplication, division, etc., on these formats has always 
required decomposition, making such verification a time- 
consuming expert task. Recent Intel Core processor designs 
have added a new shorter half-precision floating-point format, 
also known as FP16 [7]. Because of the lower datapath width, 
the inherent verification complexity of FP16 datapaths is also 
lower, bringing them closer to the set of designs that one could 
hope to verify without decompositions. 

As a practical result, we found out that all FP16 micro- 
operations could be verified closed box, including the complex 
multiplication, fused multiply-add, division and square root 
operations. This led to fast verification convergence and early 
detection of several high complexity datapath bugs. The timed 
causal fanin analysis technique was particularly crucial for 
datapaths shared between FP16 and higher precision opera- 
tions. It allowed us to avoid simulating the higher-precision 
logic, the complexity of which would have otherwise made 
verification impossible. 


II. SYMBOLIC CIRCUIT SIMULATION 


Symbolic simulation extends traditional digital circuit sim- 
ulation by allowing the input stimulus to contain symbolic 
variables in addition to the concrete values 0, 1 or X [1]. These 
symbolic variables are effectively names of values, denoting 
sets of possible actual concrete values. In the simulation, these 
symbolic values propagate alongside the concrete values, and 
in each logic gate, they may be combined with each other 
or one of the concrete values to result in either a concrete 
value or a logical expression on the symbolic variables, 
represented by an expression graph. In this paper, as in most 
of symbolic circuit simulation verification practice, we use the 
binary decision diagram (BDD) representation for symbolic 
expressions [8]. See Figure 1 for an example. 

In a bit level symbolic simulator, a single symbolic variable 
a corresponds to the set of boolean values containing both 
O and 1. If stimulus to a symbolic simulation refers to the 
variables a, b and c, the internal signals might carry values like 
a^b or aV (bA 7c). Usual logic rules apply: if the inputs to 
an AND-gate are a and 1, the output will be a, if the inputs to 
an AND-gate are a and b, the output is the logical expression 
a/b, and if the input to a NOT-gate is b, the output will 
be ~b. In symbolic simulation, a specific symbolic variable 
is associated with a specific signal and time in the stimulus. 
This does not fix the value, but instead gives a name that can 
be used to refer to the value. 

The special value X is used in symbolic simulation to denote 
a universal undefined or unknown value, which propagates 
according to rules such as in Figure 2. The value X denotes 
lack of information: we do not know whether the value is 0 
or 1. The propagation rules reflect this intuition. Symbolic 
simulation uses X’s as an abstraction mechanism: unlike 
symbolic variables, X’s are an over-approximation of Boolean 
circuit behavior. Both symbolic variables and X’s allow us to 
verify a property over a single symbolic trace, and conclude 
that it is valid over every possible trace instantiating the X’s 
and the symbolic variables with 0’s or 1’s. This ability of a 
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Fig. 1. Symbolic expressions in simulation 
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Fig. 2. Logic with the undefined value X 
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single symbolic trace to cover all behaviors of a circuit allows 
us to use symbolic simulation as a formal verification method. 

Figure 3 depicts a simplified pipelined ALU circuit with 
a 16-bit wide two-cycle datapath from inputs to outputs, and 
Figure 4 depicts a typical symbolic trace that might be used 
in the verification of this ALU, focusing on a single instance 
of an eight-bit wide bitwise OR operation. In the stimulus, the 
control signals are driven with concrete values corresponding 
to the operation, and the input data is driven with symbolic 
variables a[15],...,a[0] and b[15],...,b[0] in the one cycle in 
which the operation is issued. In all other cycles these signals 
have the undefined value X (gray waveform). In the simulation, 
the values of the output data and zero flag two cycles later are 
then expressions on the symbolic variables associated with the 


input data, and in all other cycles they are X’s. 

The practice of verification by symbolic simulation has 
similarities to bounded model checking (BMC), however with 
two important differences. First, BMC considers instances of 
a property in a time window up to a given bound, whereas 
symbolic simulation focuses on one fixed instance of a prop- 
erty, and second, BMC starts from a properly initialized state 
of a system, and symbolic simulation from an unconstrained 
state. The focus on one fixed instance of a property can be 
seen as a distinguishing aspect of symbolic simulation. 

The size of the symbolic expressions flowing in the signals 
of the circuit during the simulation is the most crucial com- 
plexity metric and the limiting factor determining what can 
and cannot be computed. We strive to minimize this symbolic 
complexity in several ways: 


1) by choosing the properties to be verified so that they 
are as narrowly targeted as possible and by restricting 
the circuit simulation to only those scenarios that are 
relevant for the property under verification, 

2) by limiting the number of symbolic variables and con- 
crete 0/1 values used in the simulation stimulus to 
maximize the use of the default undefined value X, 

3) by limiting the set of signals for which simulation values 
are computed, the times for which those values are 
computed, and the values that are computed, and 

4) by choosing concise representations for the computed 
symbolic expressions. 


For example, in execution engine verification we (1) focus on 
one operation instance at a time, (2) drive symbolic values 
on inputs only when the operation instance under verification 
samples them, (3) simulate only signals that are needed for 
the datapath of the operation and only at times relevant to 
the progression of its pipeline, and (4) use a BDD variable 
ordering that is a good match for the operation. 

Symbolic simulation works best with targeted properties of 
fixed length pipelines, typically of the transactional form 


trigger A at time t is followed by response B at time t +n 


To restrict circuit behaviors to cover only cases where the 
trigger of the property under verification is true, we use the 
technique of parametric substitutions [9], [10]. The basic setup 
for the parametric substitution algorithm is that we want to 
verify an implication C(#) = D(#) between two symbolic 
expressions C and D over a set of symbolic variables y, and the 
assumption C in some fashion makes it easier to compute the 
goal D. The algorithm creates a mapping 7 +> p from variables 
y to symbolic expressions p such that when the symbolic 
variables in p range over all possible values, the values of the 
symbolic vector p range exactly over the set of assignments 
to y for which the condition C is true. Then, the implication 
can be verified by checking whether D(j) holds. 

In the context of symbolic simulation, the aim is to check 
an implication between the trigger and the goal of the property 
being verified over the traces of the circuit. This is done by 
computing a parametric substitution from the trigger, carrying 
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out the symbolic simulation with the parametrized expressions 
P instead of the original variables y in the stimulus, and by 
checking that the verification goal is true in the resulting trace. 
For a concrete example of parametric substitution for symbolic 
simulation triggers, please see Section III below, especially 
Figure 6 and the related discussion. 

The techniques for limiting the sets of signals, times or 
values for which simulation is done are collectively called 
weakening. In weakening the user instructs the simulator to 
replace a value that would otherwise be computed with the 
undefined value X. We distinguish three kinds of weakening: 


e Universal weakening, where the user instructs the simu- 
lator to replace the values of certain signals with X’s at 
all times in the simulation. It is equivalent to the concepts 
of ‘free’ or ‘stop-at’ present in many model checkers. 

e Cycle specific weakening, where the user instructs the 
simulator to replace the values of certain signals with 
X’s, but only at specified times. This technique is unique 
to symbolic simulation, and the fact that it is even 
meaningful to talk about signals at specific times in 
the verification task is directly related to the fact that 
symbolic simulation focuses on just one fixed instance 
of the verification goal. Cycle specific weakening is an 
extremely versatile technique that allows users to apply 
their intuition about the usage of signals at times relative 
to the progress of the operation under verification in order 
to reduce the simulation cost. 

e Dynamic weakening, where the user instructs the simula- 
tor to replace any symbolic value with X, if the size of 
the expression for the value would exceed a user-given 
threshold. Dynamic weakening is a robust technique that 
allows users to quickly resolve many symbolic com- 
plexity issues caused by the computation of unnecessary 
expressions in the simulation without detailed analysis. 

Weakening is a safe complexity reduction technique: if we 
verify a property over a symbolic simulation trace with weak- 
ening, the same property also holds over a trace with the same 
stimulus and no weakening. 

The computations in symbolic simulation are conceptually 
simple and concrete. Further, they can be naturally related 
to the progress of the operation under verification through 
its pipeline. This gives the verification engineer fine-grained 
visibility into the computations on the level of individual sig- 
nals, enabling precise analysis and mitigation of computational 
complexity bottlenecks through weakening. In the context of 
execution engine verification, this visibility allows the verifier 
to identify the datapath of an individual operation and weaken 
the surrounding circuit logic. However, when pipelines for 
different operations are tightly enmeshed in a circuit, it is often 
time-consuming to determine which signals and simulation 
times are really needed for a specific operation. 


II. TIMED CAUSAL FANIN ANALYSIS 


As discussed above, the size of the symbolic expressions is 
the primary capacity barrier in a simulation, and consequently 
it is very important that we avoid the computation of symbolic 


values unnecessarily, in contexts where they do not contribute 
meaningfully to the verification goal. In a forward simulator 
this is not trivial. When simulating a certain cycle, we do 
not know yet which signals in that cycle will matter to the 
verification goal in a later cycle. 

One straightforward technique for reducing the set of signals 
for which simulation needs to compute values is the standard 
cone of influence (COI) reduction. The validity of a verifica- 
tion goal can only depend on the transitive fanin of signals 
referenced in it, and therefore signals outside of this set do 
not need to be simulated. However, for execution engines that 
contain bypass networks, the circuit forms in practice a nearly 
strongly connected graph, i.e. almost every signal is in the 
transitive fanin of almost every other signal, and the cone of 
influence reduction offers little help. 

Another source of reduction comes from the simplifying 
effect of any global constants in the design. For example, an 
AND-gate with one input a constant zero does not actually 
depend on the value of its other input, and that other input 
can be removed from the fanin of the gate without changing 
the behavior of the circuit. As designers do not intentionally 
include dead logic in their designs, such global constants 
usually reflect circuit functionality, such as test or scan modes, 
that can be completely disabled for verification purposes. They 
usually offer only marginal help in reducing simulation scope 
around the main functionality of a design. 

The timed causal fanin analysis algorithm is based on the 
idea of using constants to reduce the fanin cone of interest. 
However, this is done on a cycle-specific basis, relative to 
the cycle times in a fixed symbolic simulation, using the 
concrete 0/1 values present in that cycle only. As with cycle- 
specific weakening, the fact that we can meaningfully refer to 
a particular cycle relative to a verification task is specific to 
symbolic simulation. The three main steps of the method are: 

1) Perform a preliminary symbolic simulation to determine 
cycle-specific concrete 0/1 values in the simulation. 

2) Compute the transitive cone of influence of nodes and 
cycles in the verification goal per cycle, using the 
concrete 0/1 values from step 1 to reduce the fanins 
in each cycle. 

3) Compute a cycle-specific weakening list, per cycle, that 
weakens every signal of the circuit except the signals 
in the transitive cone of influence for that cycle, as 
computed in step 2. 

Step 1 consists of a symbolic simulation run for the circuit 
with the same stimulus that is used for the main verification 
run. However, for this initial simulation, the dynamic weaken- 
ing threshold is low. As described in Section II, this means that 
any symbolic expressions above the threshold are discarded 
and replaced with X’s in the simulation. The size threshold 
is specified by the user. All relevant cycles of the resulting 
stimulation trace are then scoured for all concrete 0/1 values. 

It is important to note that this preliminary simulation is 
much more than just timed constant propagation. First, the 
trigger of the property has already been factored into the 
stimulus with parametric substitution, and any concrete 0/1 
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values implied by the trigger are present in the trace, especially 
in the pipelined datapath control signals. Second, in addition 
to the concrete values that the trigger forces directly in the 
stimulus, also the concrete values that are implied indirectly 
by circuit logic together with the trigger restrictions are present 
in the trace, due to the canonicity of the BDD representations 
and the automatic simplification in BDD operations. 

Step 2 consists of a backwards traversal over relevant 
simulation cycles, starting from the last cycle of interest and 
proceeding back in time. For each cycle, we compute the 
causal fanin at that cycle using the concrete 0/1 values present 
at the cycle to reduce the causal fanin cone. 

For a combinational gate s of the circuit, we define the 
combinational causal fanin set of s at simulation time t to be 
the set of signals s;, such that Sin is an immediate fanin of s 
and either 


e Sin has a concrete 0/1 value in cycle ¢ in the simulation 
in Step 1, or 
e the value of sin may affect the value of s, given all the 
concrete 0/1 values in the fanins of s in cycle ¢ in the 
simulation in Step 1. 
In short, for each cycle the concrete 0/1 values computed in 
Step 1 for that cycle are used to reduce the fanin cone of 
combinational gates. For example, if selectors to a mux have 
concrete 0/1 values in a certain cycle, only the single mux 
input that is selected by those selectors is in the timed causal 
fanin in that cycle. 
For a flip-flop (state element) sy of the circuit, with input 
Sin and clock c, we define the flip-flop causal fanin set of sg 
at simulation time t by the rules: 


e If the clock c toggles in cycle ¢ in the simulation in Step 
1, then sin belongs to the set. 
e If the clock c does not toggle in cycle ¢ in the simulation 
in Step 1, then sy belongs to the set. 
e If the clock c is X in cycle ¢ in the simulation in Step 1, 
then both sin and sp belong to the set. 
Conceptually, if we do not know whether the clock toggles or 
not, both the input and the held value of the flip-flop matter. 
For each cycle t, we then define the timed causal fanin 
set cfan(t) as the minimal set of circuit signals satisfying the 
following rules: 


1) If the verification goal directly refers to signal s in cycle 
t on the simulation, then s € cfan(t). 
2) If signal s is in the flip-flop causal fanin set of a flip-flop 
Sg at simulation time (t+ 1), and sy € cfan(t +1), then 
s € cfan(t). 
3) If signal s is in the combinational causal fanin set of a 
combinational gate Sou at time t, and Sou € cfan(t), then 
s € cfan(t). 
For each cycle t, we compute cfan(t) by starting from the set of 
signals determined by the rules (1) and (2) and by constructing 
the transitive closure of the set under rule (3), stopping at the 
flip-flop boundary. 
Step 3 finally constructs a weakening list that for every 
cycle t replaces the value of every signal not in cfan(t) with X. 


This weakening list is then used in a full symbolic simulation 
for the original verification goal. As the computation of the 
timed causal fanin in Step 2 includes all signals and times 
that may affect the signal-time references in the property 
under verification, the weakening list never abstracts with 
X any values that could contribute to the property. As an 
optimization, we can alternatively weaken only the barrier of 
signals whose fanin intersects with cfan(t) but which are not 
in cfan(t) themselves. 

As a point of comparison, consider the same verification 
task posed as a bounded model checking problem. If we look 
at just the timed constant propagation aspect of the preliminary 
simulation, and the concrete 0/1 values directly forced by 
the trigger, an analogous constant propagation process would 
take place at an early point inside the SAT call for the BMC 
problem, resulting in expression simplification similar to the 
fanin reduction above. As for the concrete 0/1 values indirectly 
implied by the trigger and the circuit logic, sooner or later they 
either might or might not be noticed and propagated by the 
SAT engine, depending on how hard the engine tries to de- 
termine constants. However, this whole process is completely 
hidden from the user, inside a SAT engine. In particular, if 
a potentially helpful simplification does not happen, either 
because the engine misses it or because the trigger does not 
capture the user intent accurately, the issue manifests to the 
user only through increased run time or the inability of the tool 
to resolve the verification goal, without actionable feedback 
that would enable the user to assist the tool. 

However, when we use timed causal fanin analysis in 
the symbolic simulation flow, the results of the preliminary 
simulation and the concrete 0/1 values that are or are not 
present are visible and accessible to the user. The values can be 
queried, viewed as waveforms and root-caused through circuit 
gates. The user can understand what happens in the simulation 
and compare that to their intuition and expectations about what 
should happen. The concept of the timed causal fanin cone 
itself is based on a clear operational intuition, allowing the user 
to understand the computation in terms of circuit functionality. 
A commonly asked debug question is: “why is signal s in 
cycle ¢ in the timed causal fanin cone of my property, when 
conceptually it should not matter, for example because it is 
in a different unit/datapath/pipestage?” This question can be 
concretely answered by showing a path of dependencies from 
the given signal and time through fanin relations to some signal 
and time relevant to the property being verified. 

As an example, consider the simplified ALU circuit in 
Figure 5 with a one-cycle adder unit and a two-cycle multiplier 
unit. At the interface, the signal vid marks a valid operation 
and mul chooses between addition and multiplication. Further, 
suppose that we are focusing on adder correctness as expressed 
in the following property, where N and P are the next-time and 
previous-time temporal operators, respectively: 


N: 


(vld ^ >mul) AP —(vld Amul) = N (is_ok(res)) 
SEESE Se LELA 


ADD 
time t 


NOT MUL 
time (t— 1) 


RESULT OK 
time (t+1) 
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Fig. 5. Simplified ALU with adder and multiplier 
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Fig. 6. Stimulus and preliminary simulation trace 
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Fig. 8. Timed causal fanin cone computation 


Conceptually this property says that if an addition operation is 
issued, and there is no pipeline hazard from a multiplication 
operation a cycle ago, then the circuit will produce functionally 
correct output in the next cycle (where we have omitted the 
details of ‘functionally correct output’ and its dependency on 
the data input signals). 

Figure 6 depicts a stimulus and trace for the Step 1 prelimi- 
nary simulation on the circuit, with an instance of the property 
above with time ¢ = 1, starting in cycle | and producing output 
in cycle 2. The stimulus values for the control signals vld 
and mul in cycles 1 and 0 have been generated by parametric 
substitution from the triggers of the property: 

e In cycle 1, the stimulus associates the concrete value 1 
with the signal vid and the concrete value 0 with the 
signal mul, since this is the only possible assignment 
satisfying the trigger ‘ADD in cycle 1’, i.e. vid A —mul. 

e In cycle 0, the stimulus associates a symbolic variable v 
with the signal vld and the symbolic expression =v A m 
with the signal mul, reflecting the trigger “NOT MUL 
in cycle 0’. Note that the possible values of these two 
symbolic expressions range exactly over the set of assign- 
ments to vid and mul that make the trigger (vid A mul) 
true in cycle 0, a guarantee of parametric substitution. 
Note also that no concrete 0/1 assignment would capture 
the trigger fully, since there are three possible concrete 
value pairs satisfying the trigger. 

Simplification on internal control signals, as depicted in Figure 
7, then leads to the trace of Figure 6. Using the cycle-specific 
concrete 0/1 values from this trace, Step 2 of the timed causal 
fanin analysis method proceeds as in Figure 8. In Step 3, all 
signals and times outside the timed causal fanin of Figure 8 
are weakened in the main simulation. Note, in particular, that 
all multiplier datapath logic is automatically weakened by the 
timed causal fanin algorithm. 

From the perspective of the user applying the timed causal 
fanin method, the practical workflow can be divided into two 
stages. First, there is the computation of the causal fanin cone 
in Steps 1 and 2. In this stage the user may need to adjust 
a default dynamic weakening threshold for the preliminary 
simulation in Step 1 or the default depth of the fanin cone 
traversal in Step 2 to balance two needs. On the one hand, 
the threshold and the depth of the fanin cone need to be low 
enough that the steps can be computed in a reasonable time. 
On the other, the threshold and depth need to be high enough 
that as many concrete internal values as possible are computed 
to reduce the causal fanin cone. In this first stage of the work 
the user also may find out that the verification triggers are not 
strong enough to guarantee the satisfaction of the verification 
goals, by simply looking at the causal fanin cone and noticing 
unexpected causal dependencies. These may either reflect a 
design bug, or a need to strengthen the triggers to properly 
capture the intent of the property under verification. 

In the second stage of the work the user then applies the 
weakening list computed in Step 3 in the main simulation, 
debugs any failures, and repeats the main simulation if neces- 
sary. In many instances the main simulation is less resource 
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intensive than the preliminary one, since although the symbolic 
expressions that need to be computed are larger, the number of 
signals for which they actually need to be computed is much 
lower, thanks to the timed causal fanin weakening list. 

The timed causal fanin algorithm is helpful in most sym- 
bolic simulation verification tasks, and we use it as a routine 
step in our verification flow. Already on its own, symbolic 
simulation is at its strongest for narrowly targeted properties, 
and the timed causal fanin method accentuates this strength. 
When comparing the automated weakening provided by the 
method to manually crafted weakening lists, in our experi- 
ence the automatically produced weakening is almost always 
superior, as user time and patience for fine-grained analysis of 
the design is often limited. As a weak point, the presence of 
data-qualified clocks in a design tends to reduce the efficacy 
of the method, as then the timed causal fanin cone will include 
same combinational logic over multiple cycles. 

Two major building blocks underlying the timed causal 
fanin method are fundamentally BDD-based: first the para- 
metric substitution algorithm, and secondly the automated 
simplification of symbolic expressions in the internal wires 
of the circuit, which results in the concrete 0/1 values that are 
used to contain the fanin cone. If we want to avoid BDD’s and 
simulate with non-canonical expressions and use SAT instead, 
the same crucial process of identifying simplifying internal 
concrete 0/1 values could be achieved by speculative SAT 
queries checking for constants in the preliminary simulation 
under the trigger assumptions. The sheer number of internal 
signals in many circuits is a challenge in this approach, though. 
What works better in practice is a hybrid approach, where 
the preliminary simulation uses BDD’s, with the resulting 
automated simplification, but the main simulation used for 
the verification of the goals in carried out with non-canonical 
expressions and SAT. 


IV. EXECUTION ENGINE FORMAL VERIFICATION 


At high level, a single Intel Core consists of a set of 
major design components called clusters. The front-end cluster 
fetches and decodes architectural instructions and translates 
them to micro-operations (abbreviated as uops), which the out- 
of-order cluster then schedules for execution. The execution 
engine, residing in the EXE cluster, carries out data compu- 
tations for all micro-operations. The memory cluster handles 
memory accesses and may contain first level caches. Outside 
of an individual core is a system-on-chip layer including, for 
example, a graphics processing unit and a memory controller. 

The execution engine for a typical Intel Core processor 
design implements over 5000 distinct uops in several different 
units: the integer execution unit (IEU) contains logic for plain 
integer and miscellaneous other operations, the single instruc- 
tion multiple data (SIMD) integer unit (SIU) contains logic 
for packed integer operations, the floating-point unit (FPU) 
implements plain and packed floating-point operations such as 
FADD, FMUL, FDIYV, etc., the address generation unit (AGU) 
performs address calculations and access checks for memory 
accesses, the jump execution unit (JEU) implements jump 


operations and determines and signals branch mispredictions, 
and the memory interface unit (MIU) receives load data from 
and passes store data to memory cluster. 

Formal verification of execution datapaths, especially for 
floating-point and other arithmetic operations has been a focus 
area at Intel ever since the Pentium® FDIV bug in 1994. The 
primary vehicle for this work is symbolic simulation, incor- 
porated in Intel’s in-house Forte/reFLect verification toolset 
under the name of Symbolic Trajectory Evaluation (STE) [2]. 
All Intel Core processor execution engine data-paths since 
2005, as well as most Intel Atom processor and Gen Graphics 
arithmetic engines have been formally verified using symbolic 
simulation [3], [6]. 

In formal verification, every uop corresponds to a separate 
symbolic simulation task. In the verification setup for a single 
uop the control signals are set to fix the data-path controls to 
match a single instance of that uop, and symbolic variables 
on the data are used to exhaustively simulate the data-path 
instance. The simulation is connected to an abstract functional 
reference model for the uop through source and write-back 
mappings, and the output of the design and the reference 
model compared. These design-dependent mappings extract 
the intended source and result values for the uop at the relevant 
times relative to the instance we are verifying. 

Formal verification of complex designs would ideally be 
done by closed-box verification for its many advantages: a 
well-defined specification, no need of insight into implemen- 
tation details, and low sensitivity to internal design changes. 
For a large majority of uops in the execution engine, the data- 
path can be exhaustively symbolically simulated in one pass 
at the full cluster level. 

However, for complex floating-point arithmetic, such as 
multipliers, fused multiply-adders and dividers, the compu- 
tation of symbolic expressions for the datapaths is fundamen- 
tally technically infeasible. Instead, the verification of these 
complex uops is done through a decomposed reference model 
that splits an operation to several sequential stages, where 
each stage of the reference model is separately related to a 
stage of the implementation. With such decomposition cut- 
points, we reduce symbolic simulation complexity, as each 
stage on its own produces smaller symbolic expressions than 
a full input-to-output closed-box simulation. For years, this 
has been the technique used for all the floating-point types 
traditionally implemented on Intel designs, i.e., single, double, 
and extended precision floats. 

Decomposed verification is technically much harder than 
closed-box verification, requiring both special verification ex- 
pertise and detailed insight into implementation details to map 
the decomposition stage boundaries to the design. It is also 
much more sensitive to even small design changes, making 
the maintenance cost high. Generally, the more stages the 
decomposition has, the harder the verification task is. The 
hardest datapath verification tasks on current Intel processor 
designs are the dividers, which need a series of decomposition 
stages and advanced complexity management strategies in 
each individual stage. 
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V. HALF-PRECISION FLOATING-POINT ARITHMETIC 


Floating-point numbers are a binary representation for a 
subset of real numbers as triples (s,e,m), where the sign 
s is a single bit, and the exponent e and mantissa m are 
unsigned bit vectors of some fixed lengths. The IEEE standards 
on floating-point numbers define several different formats 
differing on details, as well as special encodings for zeros, 
infinities, denormal numbers (very small numbers that are 
below the main range of values representable in a format), 
and other exceptional values [11]. Since only a subset of 
the reals is representable as floating-point numbers, not all 
results of arithmetic operations on floating-point numbers can 
be expressed precisely as floating-point numbers themselves. 
Therefore, the IEEE standards define the concept of rounding, 
determining which sufficiently close representable number 
should be used, if the accurate result is not representable. 

Intel designs have traditionally supported three formats of 
floating-point numbers: single, double, and extended precision. 
Recently, as a part of the AVX-512 extension set in the latest 
Intel Core processor designs, support was added for a new 
shorter floating-point format, the so-called half-precision or 
FP16, consisting of one sign bit, five exponent bits and ten 
mantissa bits [7]. While the new format offers a narrower range 
and less precision, it allows twice as many values to be packed 
into a vector than with single-precision floats, doubling the 
effective performance of vectorized algorithms for applications 
that do not need higher precision arithmetic. 

The architectural and micro-architectural instruction sets of 
the latest Intel Core processor designs support most com- 
mon arithmetic half-precision operations natively. Some half- 
precision uops are implemented in dedicated design units, 
some others in units shared with higher precision arithmetic. 
Half-precision division and square root uops are implemented 
by an iterative design shared with the similar higher precision 
uops. In contrast to some higher precision operations, denor- 
mal input and output values are handled natively for half- 
precision arithmetic, without microcode assistance. 

As the basic datapath for a half-precision uop has only 
half as many input data bits than the corresponding single- 
precision uop, we know that the size of symbolic expressions 
in its simulation is always lower than for single precision. 
Without experimentation we do not know how much lower, 
as the symbolic expression sizes can be at best linear and 
at worst exponential in the number of input bits, depending 
on the operation. What we do know is that any verification 
recipes that work for single precision should easily work 
for half precision. Also, we can realistically hope that the 
reduction in size might be large enough to obviate the 
need for decomposition for some of the complex operations, 
pushing them to the domain of closed-box verification, or 
at least reduce the decomposition needed. On the negative 
side, experience shows that native denormal handling tends to 
materially increase symbolic complexity, as denormals break 
the separation of exponent and mantissa datapaths. Also, we 
know that special care will be needed for uops implemented 


in units shared between half precision and higher precisions 
to avoid the prohibitive cost of simulating also the higher 
precision behavior. 


From this starting point, we carried out verification of 
all half-precision arithmetic uops on an Intel Core processor 
design. The technical learnings from the initiative can be 
summarized as follows: 


e Simple floating-point uops such as comparisons, conver- 
sions to and from integers, reciprocals, etc., that allow 
closed-box verification for higher precisions, were easily 
verifiable for half precision. As anticipated, floating- 
point addition (FADD) could also be directly verified, 
in contrast with higher precisions, where FADD needs an 
exponent difference-based case split. Timed causal fanin 
analysis was essential in the separation of the simple uop 
and FADD datapaths from the complex ones implemented 
in the same design units. 

e As the first result for known high complexity uops, we 
were able to verify floating-point multiplication (FMUL) 
directly without a decomposition. This is in marked 
contrast with higher precisions where decomposition is 
unavoidable, as the symbolic expression sizes for mul- 
tiplication are known to be exponential. However, the 
lower number of mantissa bits for half precision means 
that we are not too far up the exponential curve yet 
in the basic datapath for the operation. For FMUL, the 
datapath is shared with the more complex fused multiply- 
add (FMA) operation. Timed causal fanin analysis helps 
FMUL verification by removing FMA-specific parts of 
the shared datapath, in particular in the rounding logic 
where FMUL exhibits only a narrow range of possible 
behaviors compared to FMA. 

e Somewhat surprisingly, we were also able to verify half- 
precision fused multiply-add (FMA) uops without decom- 
position. This required careful complexity management, 
and a large case split on addend mantissa values to reduce 
the symbolic complexity of the basic datapath, with a high 
total run time. As FMA is the most complex operation 
on its shared datapath, there is no circuit logic that timed 
causal fanin analysis could just directly cut out. However, 
for each case in the case split, the simulation of the basic 
datapath alone approaches the capacity limits of the tool. 
How timed causal fanin analysis helps is by removing 
logic that is on the basic datapath, but is not relevant to 
the specific case. 

e Finally, with heavy use of simplifying case splits and 
timed causal fanin analysis, we were able to carry out 
closed-box verification for half-precision division (FDIV) 
and square root (FSQRT) operations, as well. For divi- 
sion and square root, timed causal fanin analysis was 
indispensable, as the datapaths are mixed with the higher 
precision ones, and the long-latency uops have ample 
potential for uncontrolled symbolic expression growth. 


The most complex arithmetic datapath proofs showed that for 
FP 16, verification of all uops can be done closed box. In most 
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of these tasks and all high complexity ones, the contribution of 
timed causal fanin analysis cannot be quantified by the compu- 
tation time or memory usage with the method vs without, since 
without either automated or manual weakening the closed-box 
verification tasks are computationally infeasible. In our view, 
the best metric is the human effort required for the effort. 


The largest positive impact was observed on the operations 
that are traditionally the most complicated and heavy to 
verify. For FMUL, the first higher-complexity operation, we 
implemented a new verification strategy that did not include 
the decomposition that the higher-precision proof requires. 
Note that FMUL is in fact FMA without an addend, which 
makes it a lighter task for verification, however any bug we 
would catch on FMUL, also exists on FMA. We continued 
with a new verification strategy for the FMA operations: 
closed-box input-to-output verification with a case-split on 
addend mantissa value. The effort of FMA verification bring- 
up was reduced from several quarters for a higher-precision 
‘big-FMA’ in a standard Intel Core processor development 
project, to a couple of weeks. 


For FDIV and FSQRT the effort reduction was also sub- 
stantial. The proof was dramatically simplified, compared to 
the traditional multi-stage decomposed higher-precision proof. 
The FDIV and FSQRT proofs were completed in 6-8 weeks 
and provided confidence in design quality and arithmetic 
correctness. Like the FMA, effort for these verification tasks 
is usually measured in quarters of work. 

Comparing then automated vs manual generation of weak- 
ening lists, the simple uop and FADD verification likely could 
have been carried out with manual analysis, as these tasks are 
not computationally challenging and a coarse analysis would 
suffice. On the other hand, a manual separation of the FMUL 
logic from the FMA, or the logic used vs not used by the 
different FMA cases, and especially the separation of the 
FP16 FDIV and FSQRT datapaths from the higher precision 
ones would likely have required an extraordinary human effort 
focusing on design minutiae. 

The main advantages of the closed-box verification that 
enabled quick results were clear specification, ease of failure 
reproduction in dynamic validation with concrete source val- 
ues, and the absence of any need to locate cut-points and define 
complicated side conditions. The first corner-case datapath bug 
was found in less than a week of work. Altogether, the FP16 
verification initiative caught several extreme complexity bugs 
in just a few weeks of works at an early stage of the design 
project. This reduced the design cost of fixing the issues, and 
most importantly prevented them from escaping to the silicon 
implementation. Here are two examples: 


1) An FMA16 uop multiplies two small positive normal 
numbers, produces a very small intermediate value, 
and adds the addend — the smallest normal negative. 
The mathematically accurate result is tiny, between the 
smallest normal negative and zero. Since Flush-To-Zero 
(FTZ) mode was set, the result ought to be zero, but the 
design returned the smallest normal negative. 


2) FMA received three very specific normal numbers as 
inputs, and FTZ was set. We expected to produce the 
smallest normal number after rounding to nearest, but 
the result was flushed to zero. The specific inputs were: 
a: s=0 ; e = 00010 ; m = 1.0110000000 
b: s=0;e=O01111 ; m=1.0001011011 
c:s=1;e=00010 ; m=1.1111111101 
The intermediate result of the operation after it was 
normalized was: s=1;e=0;m=1.11111111111—- 
one extra bit after the mantissa length, which is exactly 
at half-point for rounding, and therefore needs to round 
up. After rounding and normalizing we got a normal 
(non-tiny) number: s = 1 ; e=1 ; m = 1.0000000000, 
that should not have been flushed to zero. 


VI. SUMMARY 


Empirical experience has consistently shown that the timed 
causal fanin reduction algorithm is a key complexity reduction 
technique for practical symbolic simulation. It has also proven 
to be robust in face of design changes and over different design 
styles. 

Timed causal fanin analysis was the primary enabler allow- 
ing us to verify all FP16 uops, including the most complex 
arithmetic operations, without decompositions. Closed-box 
verification greatly reduced the development effort of complex 
proofs, leading to fast detection of deep corner-case bugs in 
early stages of the project. Avoiding the use of decomposition 
has lowered the sensitivity to design implementation and made 
the verification collateral easily reusable for future projects. 
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Abstract—Recent methods based on Symbolic Computer Al- 
gebra (SCA) have shown great success in formal verification 
of multipliers and — more recently — of dividers as well. In 
this paper we enhance known approaches by the computation 
of satisfiability don’t cares for so-called Extended Atomic Blocks 
(EABs) and by Delayed Don’t Care Optimization (DDCO) for 
optimizing polynomials during backward rewriting. Using those 
novel methods we are able to extend the applicability of SCA- 
based methods to further divider architectures which could not 
be handled by previous approaches. We successfully apply the 
approach to the fully automatic formal verification of large 
dividers (with bit widths up to 512). 


I. INTRODUCTION 


Arithmetic circuits are important components in processor 
designs as well as in special-purpose hardware for compu- 
tationally intensive applications like signal processing and 
cryptography. At the latest since the famous Pentium bug [1] 
in 1994, where a subtle design error in the divider had not 
been detected by Intel’s design validation (leading to erroneous 
Pentium chips brought to the market), it has been widely rec- 
ognized that incomplete simulation-based approaches are not 
sufficient for verification and formal methods should be used 
to verify the correctness of arithmetic circuits. Nowadays the 
design of circuits containing arithmetic is not only confined to 
the major processor vendors, but is also done by many different 
suppliers of special-purpose embedded hardware who cannot 
afford to employ large teams of specialized verification engi- 
neers being able to provide human-assisted theorem proofs. 
Therefore the interest in fully automatic formal verification of 
arithmetic circuits is growing more and more. 

In particular the verification of multiplier and divider cir- 
cuits formed a major problem for a long time. Both BDD- 
based methods [2], [3] and SAT-based methods [4], [5] for 
multiplier and divider verification do not scale to large bit 
widths. Nevertheless, there has been great progress during 
the last few years for the automatic formal verification of 
gate-level multipliers. Methods based on Symbolic Computer 
Algebra (SCA) were able to verify large, structurally com- 
plex, and highly optimized multipliers. In this context, finite 
field multipliers [6], integer multipliers [7]-[19], and modular 
multipliers [20] have been considered. Here the verification 
task has been reduced to an ideal membership test for the 
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specification polynomial based on so-called backward rewrit- 
ing, proceeding from the outputs of the circuit in direction 
of the inputs. For integer multipliers, SCA-based methods are 
closely related to verification methods based on word-level 
decision diagrams like *BMDs [21]-[23], since polynomials 
can be seen as “flattened” *BMDs [24]. Moreover, rewriting 
based approaches [25], [26] have recently shown to be able to 
verify complex multipliers as well as arithmetic modules with 
embedded multipliers at the register transfer level. 

Research approaches for divider verification were lagging 
behind for a long time. Attempts to use Decision Diagrams for 
proving the correctness of an SRT divider [27] were confined 
to a single stage of the divider (at the gate level) [28]. Methods 
based on word-level model checking [29] looked into SRT 
division as well, but considered only a special abstract and 
clean sequential (i.e., non-combinatorial) divider without gate- 
level optimizations. Other approaches like [30], [31], or [32] 
looked into fixed division algorithms and used semi-automatic 
theorem proving with ACL2, Analytica, or Forte to prove 
their correctness. Nevertheless, all those efforts did not lead 
to a fully automated verification method suitable for gate-level 
dividers. 

A side remark in [23] (where actually multiplier verification 
with *BMDs was considered) seemed to provide an idea for 
a fully automated method to verify integer dividers as well. 
Hamaguchi et al. start with a *BMD representing Q x D + R 
(where Q is the quotient, D the divisor, and R the remainder 
of the division) and use a backward construction to replace 
the bits of Q and R step by step by *BMDs representing 
the gates of the divider. The goal is to finally obtain a 
*BMD representation for the dividend R©) which proves the 
correctness of the divider circuit. Unfortunately, the approach 
has not been successful in practice: Experimental results 
showed exponential blow-ups of *BMDs during the backward 
construction. 

Recently, there have been several approaches to fully auto- 
matic divider verification that had the goal to catch up with 
successful approaches to multiplier verification: Among those 
approaches, [33] is mainly confined to division by constants 
and cannot handle general dividers due to a memory explosion 
problem. [34] works at the gate level, but assumes that 
hierarchy information in a restoring divider is present. Using 
this hierarchy information it decomposes the proof obligation 
R© = Q x D+ R into separate proof obligations for each 
level of the restoring divider. Nevertheless, the approach scales 
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only to medium-sized bit widths (up to 21 as shown in the 
experimental results of [34]). 

The approaches of [24], [35] work on the gate level as 
well, but they do not need any hierarchy information which 
may have been lost during logic optimization. They prove the 
correctness of non-restoring dividers by “backward rewriting” 
starting with the “specification polynomial” Q x D+ R— R® 
(similar to [23], with polynomials instead of *BMDs as inter- 
nal data structure). Backward rewriting performs substitutions 
of gate output variables with the gates’ specification polynomi- 
als in reverse topological order. They try to prove dividers to be 
correct by finally obtaining the 0-polynomial. The main insight 
of [24], [35] is the following: The backward rewriting method 
definitely needs “forward information propagation” to be suc- 
cessful, otherwise it provably fails due to exponential sizes 
of intermediate polynomials. Forward information propagation 
relies on the fact that the divider needs to work only within 
a range of allowed divider inputs (leading to input constraints 
like 0 < R© < D-2"~1), [24] uses SAT-based information 
propagation (SBIF) of the input constraint in order to derive 
information on equivalent and antivalent signals, whereas [35] 
uses BDDs to compute satisfiability don’t cares which result 
from the structure of the divider circuit as well as from the 
input constraint. (Satisfiability don’t cares [36] at the inputs 
of a subcircuit describe value combinations which cannot be 
produced at those inputs by allowed assignments to primary 
inputs.) The don’t cares are used to minimize the sizes of 
polynomials. In that way, exponential blowups in polynomial 
sizes which would occur without don’t care optimization could 
be effectively avoided. Since polynomials are only changed for 
input values which do not occur in the circuit if only inputs 
from the allowed range are applied, the verification with don’t 
care optimization remains correct. In [35] the computation of 
optimized polynomials is reduced to suitable Integer Linear 
Programming (ILP) problems. 

In this paper we make two contributions to improve [24] and 
[35]: First, we modify the computation of don’t cares leading 
to increased degrees of flexibility for the optimization of 
polynomials. Instead of computing don’t cares at the inputs of 
“atomic blocks” like full adders, half adders etc., which were 
detected in the gate level netlist, we combine atomic blocks 
and surrounding gates into larger fanout-free cones, leading 
to so-called Extended Atomic Blocks (EABs), prior to the 
don’t care computation. Second, we replace local don’t care 
optimization by Delayed Don’t Care Optimization (DDCO). 
Whereas local don’t care optimization immediately optimizes 
polynomials wrt. a don’t care cube as soon as the polynomial 
contains the input variables of the cube, DDCO only adds 
don’t care terms to the polynomial, but delays the optimization 
until a later time. This method has two advantages: First, by 
looking at the polynomial later on, we can decide whether 
exploitation of certain don’t cares is needed at all, and 
secondly, the later (delayed) optimization will take the effect 
of following substitutions into account and thus uses a more 
global view for optimization. Using those novel methods we 
are able to extend the applicability of SCA-based methods 


2co + so 
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ao + bo +c 
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Fig. 1. Circuit with series of substitutions. 
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from [24], [35] to further optimized non-restoring dividers 
and restoring dividers which could not be handled by previous 
approaches. 

The paper is structured as follows: In Sect. II we provide 
background on SCA and divider circuits. We motivate the need 
for novel optimizations by analyzing the existing approaches 
in Sect. III, and in Sect. IV we present the novel approach. 
The approach is evaluated in Sect. V and we conclude with 
final remarks in Sect. VI. 


II. PRELIMINARIES 
A. SCA for Verification 


For the presentation of SCA we basically follow [24]. 
SCA based approaches work with polynomials and reduce the 
verification task to an ideal membership test using a Gröbner 
basis representation of the ideal. The ideal membership test 
is performed using polynomial division. While Gröbner basis 
theory is very general and, e.g., can be applied to finite field 
multipliers [6] and truncated multipliers [17] as well, for 
integer arithmetic it boils down to substitutions of variables for 
gate outputs by polynomials over the gate inputs (in reverse 
topological order), if we choose an appropriate “term order” 
(see [11] or [14], e.g.). Here we restrict ourselves to exactly 
this view. 

For integer arithmetic we consider polynomials over binary 
variables (from a set X = {21,...,2n}) with integer coeffi- 
cients, i.e., a polynomial is a sum of terms, a term is a product 
of a monomial with an integer, and a monomial is a product 
of variables from X. Polynomials represent pseudo-Boolean 
functions f : {0,1}" > Z. 

As a simple example consider the full adder from Fig. 1. 
The full adder defines a pseudo-Boolean function fr, 
{0,1}8 > Z with fra(ao,bo,c) = ao + bo + c. We can 
compute a polynomial representation for fra by starting with 
a weighted sum 2co + so (called the “output signature” in 
[10]) of the output variables. Step by step, we replace the 
variables in polynomials by the so—called “gate polynomials”. 
This replacement is performed in reverse topological order of 
the circuit, see Fig. 1. We start by replacing co in 2cg + so 
by its gate polynomial hə + h3 — həh3 (which is derived 
from the Boolean function cg = hz V hg). Finally, we arrive 
at the polynomial ap + bo + c (called the “input signature” 
in [10]) representing the pseudo-Boolean function defined by 
the circuit. During this procedure (which is called backward 
rewriting) the polynomials are simplified by reducing powers 
v® of variables v with k > 1 to v (since the variables are 
binary), by combining terms with identical monomials into 
one term, and by omitting terms with leading factor 0. We can 
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Algorithm 1 Restoring division. 


Algorithm 2 Non-restoring division. 


1: for j = 1 to n do 


2: RO) := RG-Y) Lup. Qr-s, 

3: if RỌ) <0 then — 
4: Qn—j := 0; RM := RG) + D. 27-5; 
a: else 

6: r = 1s 

7: R:= R™®, 


also consider ao + bo + c = 2co + so as the “specification” of 
the full adder. The circuit implements a full adder iff backward 
substitution, now starting with 2co + so — ao — bo — c instead 
of 2co + so, reduces the “specification polynomial” to 0 in 
the end. (This is the notion usually preferred in SCA-based 
verification.) 

The correctness of the method relies on the fact that poly- 
nomials (with the above mentioned simplifications resp. nor- 
malizations) are canonical representations of pseudo-Boolean 
functions (up to reordering of the terms). (This is formulated 
as Lemma | in [35] and proven in [24], e.g..) 


B. Divider Circuits 


In the following we briefly review textbook know- 
ledge on dividers. For more details, see [37], e.g.. We 
use (an, 00505) := J ;;o4:2 and [an,...,ao]2 i= 
pa a;2") an2” for interpretations of bit vectors 
(an,---,@0) € {0,1}"*! as unsigned binary numbers and 
two’s complement numbers, respectively. The leading bit a, 
is called the sign bit. An unsigned integer divider is a circuit 
with the following property: 


Definition 1. Let (ro), i 1) be the dividend with sign 


bit ro, = 0 and value R© := (ro), . rO) = 
[rg rO], (dyn—1...do) be the divisor with sign bit 
dn—1 = 0 and value D := (dn—1...do) = [dn-1 ... dolz, 
and let 0 < R® < D. 27t, Then (qn-1.--.-qo) with 
value Q = lqn-1...qo) is the quotient of the division and 
(rn—1--- ro) with value R = [rn—1..-Tol2 is the remainder 
of the division, if R©) = Q - D + R (verification condition 1 
= “vcl”) and 0 < R < D (verification condition 2 = “vc2”). 


Note that we consider here the case that the dividend has 
twice as many bits as the divisor (without counting sign bits). 
This is similar to multipliers where the number of product 
bits is two times the number of bits of one factor. If both the 
dividend and the divisor are supposed to have the same lengths, 
we just set ro, =... = pO = 0 and require D > 0. Then 
D > 0 immediately implies 0 < RO < D-.2"-1, 

The simplest algorithm to compute quotient and remainder 
is restoring division which is the “school method” to compute 
quotient bits and “partial remainders” RO), Restoring division 
is shown in Alg. 1. In each step it subtracts a shifted version 
of D. If the result is less than 0, the corresponding quotient 
bit is O and the shifted version of D is “added back”, i.e., 
“restored”. Otherwise the quotient bit is 1 and the algorithm 
proceeds with the next smaller shifted version of D. 

Non-restoring division optimizes restoring division by com- 
bining two steps of restoring division in case of a negative 


: RY = RO — p.anr-t, 
: if R® < 0 then Qn—1 := 0 else qn—1 := 1; 
: for j = 2 to n do 
if ROTD > 0 then 
R® := RG-D — D. 277i 
else 
RY := RÖÐ + D. 2”7i, 
if R < 0 then qn—j := 0 else qn—j := 1; 
: R:= R™ + (1-— qo) - D; 


30-90 REN we eS 


ar-1. D+ RY = RO) 


(Sioa +2") D+ RO — RO 


? $ i (Su at 21) . D + RC) — RO) 


D- (Sa +2°)- D+ R™ -RO 


(I a) -D+R-RO 


Fig. 2. Non-restoring divider. 


partial remainder: adding the shifted D back and (tentatively) 
subtracting the next D shifted by one position less. These two 
steps are replaced by just adding D shifted by one position 
less (which obviously leads to the same result). More precisely, 
non-restoring division works according to Alg. 2. 

SRT dividers are most closely related to non-restoring 
dividers, with the main differences of computing quotient bits 
by look-up tables (based on a constant number of partial 
remainder bits) and of using redundant number representa- 
tions which allow to use constant-time adders. Other divider 
architectures like Newton and Goldschmidt dividers rely on 
iterative approximation. In this paper we restrict our attention 
to restoring and non-restoring dividers. 

For dividers it is near at hand to start backward rewriting not 
with polynomials for the binary representations of the output 
words (which is basically done for multiplier verification), but 
with a polynomial for Q- D + R. For a correct divider one 
would expect to obtain a polynomial for R©) after backward 
rewriting. As an alternative one could also start with Q- D + 
R— R©) and one would expect that for a correct divider the 
result after backward rewriting is 0. This would be a proof for 
verification condition (vc1). (Then it remains to show that 0 < 
R < D (vc2) which we postpone until later.) This idea was 
already proposed by Hamaguchi in 1995 [23] in the context of 
verification using *BMDs [21]. As already mentioned in the 
introduction, Hamaguchi et al. observed exponential blow-ups 
of *BMDs in the backward construction and thus the approach 
did not provide an effective way for verifying large integer 
dividers. 

However, this basic approach seems to be promising at 
first sight. As an example, Fig. 2 shows a high level view 
of a circuit for non-restoring division. Stage 1 implements a 
subtractor, stages j with j € {2, ..., n} implement conditional 
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Fig. 3. Optimized non-restoring divider, n = 4. 


adders / subtractors depending on the value of g,_;+1, and 
stage n+1 implements an adder. If we start backward rewriting 
with the polynomial Q- D+R—R©) (which is quadratic in n) 
and if backward rewriting processes the gates in the circuit in 
a way that the stages shown in Fig. 2 are processed one after 
the other, then we would expect the following polynomials on 
the corresponding cuts (see also Fig. 2): 

We would expect (X727 qi2'+2°)-D+R™ — R©) for the 
polynomial at cut n which is obtained after processing stage 
n+1, since stage n +1 enforces R = R™ + (1— qo): D. For 
j =n to 2 we would (by induction) expect opiate j+2 qi + 
2”-J+1).D+4+RG-Y)— R©) for the polynomial at cut j—1 after 
processing stage j, since stage j enforces RY) = RU-) — 
dn-j+1 (D: 2") + = dag (D2?) = RO=) + = 
2¢n—j4+1)(D - 2”~7). Finally, the polynomial at cut 0 after 
processing stage 1 using the equation R® = R© — D.2r-! 
would reduce to 0. 


There may be two obvious reasons why backward rewriting 
might fail in practice all the same: (1) It could be the case 
that backward rewriting does not exactly hit the boundaries 
between the stages of the divider. (2) There may be significant 
peaks in polynomial sizes in between the mentioned cuts. 

[24] and [35] show that there are additional obstacles apart 
from those obvious potential problems: In fact, with usual 
optimizations in implementations of non-restoring dividers 
the polynomials represented at the cuts between stages are 
different from this high-level derivation. The reason lies in the 
fact that the stages do not really implement signed addition 
/ subtraction. In general, signed addition / subtraction of 
two (2n — 1)-bit numbers leads to a 2n-bit number. The 
leading bit of the result can only be omitted, if “no overflow 
occurs”. The fact that no overflow occurs results from the 
input constraint 0 < RO < D-2"~1 of the divider and 
from the way the results of the different stages are computed 


[24]. Usual implementations even go one step further: By 
additional arguments using the input constraint and the circuit 
functionality it can be shown that it is not only possible 
to omit overflow bits of the adder / subtractor stages, but 
it is even possible to omit the computation of one further 
most significant bit. For a detailed analysis see [35]. These 
considerations lead to an optimized implementation shown 
in Fig. 3 for n = 4, e.g.. (For simplicity, we present the 
circuit before propagation of constants which is done however 
in the real implemented circuit.) In summary, it is important 
to note that (1) the stages in Fig. 3 cannot be seen as real 
adder / subtractor stages as shown in the high-level view from 
Fig. 2, (2) backward rewriting leads to polynomials at the cuts 
which are different from the ones shown in Fig. 2, and (3) 
unfortunately those polynomials have (provably) exponential 
sizes. 

The conclusion drawn in [35] was that verification of (large) 
dividers using backward rewriting is infeasible, if there is 
no means to make use of “forward information” obtained by 
propagating the input constraint 0 < RO < D-.2"-! in 
forward direction through the circuit. This idea indeed made 
it possible to verify large non-restoring dividers with bit widths 
up to 512 bits. 


III. ANALYSIS OF EXISTING APPROACH 


In this section we motivate our approach by analyzing 
weaknesses of the method from [35]. The algorithm from [35] 
starts with a gate level netlist and detects atomic blocks [16] 
like full adders and half adders. This results in a circuit with 
non-trivial atomic blocks (full adders, half adders etc.) and 
trivial atomic blocks (original gates not included in non-trivial 
atomic blocks). The method computes a topological order <top 
on the atomic blocks with heuristics from [15], [16], computes 
satisfiability don’t cares [36] at the inputs of the atomic 
blocks, and performs backward rewriting starting with the 
specification polynomial Q- D+ R—R ©) by replacing atomic 
blocks in reverse topological order. During backward rewriting 
two optimization methods are used, if they are needed to keep 
polynomial sizes small: The first method uses information 
on equivalent and antivalent signals (which is derived by 
SAT-based information propagation (SBIF) using the input 
constraint and the don’t cares at the inputs of atomic blocks), 
the second method optimizes polynomials modulo don’t cares 
by reducing the problem to Integer Linear Programming (ILP). 


A. Insufficient don’t care conditions 


Let us start by considering stage n + 1 of the non-restoring 
divider (see Figs. 2 and 3). Analyzing the method from 
[35] applied to optimized n-bit non-restoring dividers, we 
can observe that it does not make use of don’t cares at 
the inputs of atomic blocks corresponding to stage n + 1 
(although there exist some don’t cares), but it makes use of 
the (only existing) antivalence of qo and 7), which is shown 
by SAT taking already proven satisfiability don’t cares into 
account (as already described above). If we only consider 
the circuit of stage n + 1 (ie., the circuit below the dashed 
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line in Fig. 3), replace r by =qo (i.e. if we make use 


of the mentioned antivalence), and start backward rewriting 
with Eia. qi2")- eas di2*) + Oy tO tee 
(S É r2), then we indeed obtain exactly the polynomial 
(Eiz 2 + 20) - (Do aid") + (Dio riz — (1 - 
gE r0 2i) which corresponds (with (1— qo) = 
rf) to (ory gi2’+2°)-D+R™ — R© as shown in Fig. 2, 
cut n. Fig. 4 shows the size of the final polynomial for stage 
n + 1 with increasing bit width n, with and without using 
the antivalence r = ~qo. Fig. 4 clearly shows that it is 
essential to make use of the mentioned antivalence. 


Now we consider another sar *  foantivalence used 
version of the non-restoring 10° x antiv. not used 
divider which is slightly fur- gy wee 
ther optimized. It is clear 2 10 : a 
that in a correct divider £ 10°! a 
the final remainder is non- es e 
negative, ie, rn_1 = 0. 1gs 
Therefore there is actually 14 8 16 32 64 128 256 512 


Bit width 
Fig. 4. Polynomial sizes, stage n + 1, 
optimized non-restoring divider. 


no need to compute rn—ı 
and the full adder shown in 
gray in Fig. 3 can be omit- 
ted. The verification condi- 
tion vcl is then replaced 
by R = Q-D+ igi 
Ya r;2'. Whereas in the 108 
original circuit making use 10? 
of antivalences was essen- 10" 
tial for keeping the polyno- a ar eee: 
Tial See) ally see Fig. 5. Pol a on stage n + 
n + 1 of the further opti- 1, further openibel eee di- 
mized version there are nei- vider. 
ther equivalent nor antivalent signals anymore. The only don’t 
cares in the last stage (after constant propagation) are two 
value combinations at the inputs of the now leading full adder. 
However, making use of those don’t cares does not help in 
avoiding an exponential blow up as Fig. 5 shows. Intuitively 
it is not really surprising that removing the full adder shown 
in gray potentially makes the verification problem harder, 
since the partial remainders R, R™),... , R© in the high-level 
analysis of polynomials at cuts (see Fig. 2) represent signed 
numbers, but now R does not introduce a sign bit anymore. 
Nevertheless, this raises the question whether the derivation 
of don’t care conditions may be improved in a way that don’t 
care optimization can avoid exponential blow ups like the one 


shown in Fig. 5. 


1074|0 with DC opt. č 
x without DC opt. 
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B. Don’t care optimization with backtracking 


The method from [35] does not make use of don’t care 
optimizations immediately, but stores a backtrack point after 
backward rewriting was applied to an atomic block which has 
don’t cares at its inputs or has input signals with equivalent / 
antivalent signals. Whenever the polynomial grows too much, 
the method backtracks to a previously stored backtrack point 
and performs an optimization. Alg. 3 shows a simplified 


Algorithm 3 Backward rewriting with backtracking. 


Input: Specification polynomial SP’, Input constraint IC, Circuit CUV with 
atomic blocks a1 <top --- <top Gm in topological order <zop 
Output: 1 iff specification holds for all inputs satisfying IC 
1: SPm := SP'"'; oldsize := size( S Pm); i := m; ST := Q; 
2: (de(aı),...,dc(am)) := Compute_DC(CUV, IC); 


4 SP;—1 := Rewrite(SP;, ai); 

5 if size(S.P;_1) > threshold - oldsize and ST # Ø then 
6: (SP, j) = pop(ST); 

T t:= 9; SPi-1 := SP; 

8: SA is Opt DC(SF;—1; dc(a;)); 

9: else 
10: if dc(a;) Æ Ø then push(ST, (SP;—1, 2)); oldsize := size(S P;—1); 
11: i:=i— l; 

12: return evaluate(S Po); 


overview of the approach.* For ease of exposition we omitted 
handling of equivalences / antivalences here. 

As shown in [35], the approach works surprisingly well. It 
tries to restrict don’t care optimizations (which are illustrated 
later on in Example 1, for more details see [35]) to situations 
where they are really needed. Only if the size threshold 
in line 5 is exceeded, backtracking is used and don’t care 
optimization comes into play. A further analysis shows that 
the success of the approach in [35] is partly due to the 
following reasons: (1) In the non-restoring dividers used as 
benchmarks, atomic blocks that have any satisfiability don’t 
cares grow only linearly with the bit width. (2) Only a linear 
amount of backtrackings is needed. (3) On the other hand, if 
backtrackings have to be used, don’t care assignments have 
an essential effect in keeping the polynomials small (the size 
of the polynomials is quadratic in n just like the specification 
polynomial we start with). 

Let us now consider a very simple example which does not 
have the mentioned characteristics. 


Example 1. Consider a circuit which contains (among others) 
2n + 1 atomic blocks ag,...@gn. Those blocks are the last 
atomic blocks in the topological order and azn <top - - - <top 
ag. The initial polynomial is SP*™* = 8a + 4b + 2c + io. ag 
has inputs x1, 11, output io, defines the function io = xı Vii = 
zı +iı— z111, and we assume that it has the satisfiability don’t 
care (x1,i1) = (0,0). Correspondingly, for j =1,...,n, aj 
defines 1; = xj+1ij+ı with assumed satisfiability don’t care 
(£j+1;ij+1) = (0,0), and for j = n+ 1,...,2n, aj defines 
ij = Tj+1 V ij+1 = Tj+1 + ij+1 = Tj+1lj+1- We compute 
size(p) as the number of terms in the polynomial p and assume 
threshold = 1.5 in line 5 of Alg. 3. Then Alg. 3 computes the 
following series of polynomials 


SPm = 8a + 4b + 2c + io 
SPm—1 = 8a + 4b + 2c + x1 +41 — ziii 


SPm—2 = 8a + 4b + 2c + x1 + £2i2 — 412X272 


*S Po in Alg. 3 does not have to be 0 for correct dividers, it is sufficient 
that S Po evaluates to 0 for all inputs in the allowed input range 0 < R0) < 
D -2”—t, This can be checked by evaluate(‘S Po) in polynomial time [35]. 
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SPm—n—1 = 8a + 4b + 2c 
+ £1 + T2... En+lİn+1 — 11%Q...€n41In41 
SPm—n—2 = 8a + 4b + 2c + £1 + T2... n42 


+ £2..-€n41In42 — T2... Ln+2İn+2 


— T1... Zn42 — T1... En+lÎn+2 + T1... Ln+2İn+2 


with sizes 4, 6, ..., 6, 10. SPm—n—2 is the first polynomial 
exceeding the size limit. For each of the n + 1 preceding 
atomic blocks there was a satisfiability don’t care at the 
inputs, the size limit was not exceeded, and the corresponding 
polynomial has been pushed to the backtracking stack ST. 
Now backtracking to SPm-n—1ı takes place. (Note that it 
is easy to see that without backtracking using don’t care 
optimization the following n — 1 backwriting steps would 
quickly lead to a blowup in the polynomial sizes finally 
resulting in a polynomial with size 2°*? + 2.) SPm—n—1 is 
optimized with the don’t care (%n+41,%n41) = (0,0). Let us 
explain the idea of don’t care optimization using this example: 
Don’t care optimization adds v-(1—4n+41)-(1—%n41) for the 
don’t care (n41, in+1) = (0,0) to SPm-n-1 with a fresh 
integer variable v. For all valuations (%n41,in41) # (0,0), 
v:(1—an41):(L—in41) evaluates to 0, thus we may choose an 
arbitrary integer value for v without changing the polynomial 
“inside the care space”. The choice of v is made such that 
the size of SPm-n—1 is minimized. So the task is to choose v 
such that the size of 8a + 4b + 2c + z1 + %2...@n41tn4+1 — 
U10Q..-En4 1tn41 $V — Ving — Vn 41 + VL y41In41 IS Min- 
imal. We achieve this by using an ILP solver to get a solution 
for v which maximizes the number of terms with coefficients 
0 and therefore minimizes the polynomial. It is easy to see 
that the best choice is v = 0 in this case. This means that we 
arrive at an unchanged polynomial SPm-—-n—1 and the don’t 
care did not help. Then we do the replacement of anı again, 
detect an exceeded size limit again, backtrack to SPy,—» and 
so on. Exactly as for SPm—n—1, don’t care assignment does 
not help for SPpy—n,...,SPm—2. The first really interesting 
case occurs when backtracking arrives at SPm-1. Adding 
v:-(1—2):(1—%1) with a fresh variable v to SP m—; results in 
8a+4b+2c+v+(1—v)zı+(1—v)i1+(v—1)xıiı and choosing 
v = 1 leads to the minimal polynomial 8a +4b+2c+ 1 which 
is even independent from 11. Now replacing aj,...,Q2n does 
not change the polynomial anymore and we finally arrive at 
SPm-2n-1 = 8a + 4b + 2c + 1 (without further don’t care 
assignments). 


The example shows that the backtracking method works 
in principle, but it comes at huge costs: Backtracking po- 
tentially explores all possible combinations of assigning or 
not assigning don’t cares for atomic blocks with don’t cares 
by storing backtrack points again in line 10 of Alg.3 after 
successful as well as unsuccessful don’t care optimizations. In 
the example this leads to 2"*1 rewritings for atomic blocks 
and 2”*!— 1 unsuccessful don’t care optimizations, before we 
finally backtrack to SPm—ı where we do the relevant don’t 
care optimization. 


Our goal is to come up with a don’t care optimization 


Algorithm 4 Computation of satisfiability don’t cares. 


Input: Input constraint IC, Circuit CUV with EABs ea, <top - 
topological order <top, dc_cand(ea;)Vj € {1,..., 1} 

Output: Satisfiability don’t cares at inputs of EABs resulting from IC 

1: I= {j € {1,...,1} | de_cand(ea;) 4 0}; tora = 1; x = IC; 

2: dc(eai) = 9; ...; de(ea1) = 0; 

3: while I ~ Ø do 


++ Stop ea in 


i = min(J); slice = {eai,,,,---,eai-1}; 
x = compute_image(x, slice); 
,En) E€ dc_cand(ea;) do > £1,..., £n: input signals of ea; 


if X|a,=e),..., zn=en = 0 then dc(ea;) = dc(ea;)U{(e1,... 
R= I\ {i}; told = a; 
return (dc(ea;),..., dc(ea;)); 


4 

5: 

6: for (€1,... 
J; En) fs 
8 

9: 


method which is robust against situations like the one illus- 
trated in Example 1 where we have many blocks with don’t 
cares, but only a few of those don’t cares are really useful 
for minimizing the sizes of polynomials. As we will show in 
Sect. V, we run into such situations when we verify restoring 
dividers using the method from [35]. 


IV. DON’T CARE COMPUTATION AND OPTIMIZATION 
A. Don’t care computation for extended atomic blocks 


This section is motivated by [8], [11] which combine several 
gates and atomic blocks into fanout-free cones, compute 
polynomials for the fanout-free cones first and use those 
precomputed polynomials for “macro-gates” formed by the 
fanout-free cones during backward rewriting. Whereas in [8], 
[11] the purpose of forming those fanout-free cones is avoiding 
peaks in polynomial sizes during backward rewriting without 
don’t care optimization, the motivation here is different: Here 
we aim at detecting more and better don’t cares. 

First of all, we detect atomic blocks for fixed known 
functions like full adders and half adders as already mentioned 
in Sect. II. The result is a circuit with non-trivial atomic 
blocks and the remaining gates. Now we want to combine 
those atomic blocks and remaining gates into “extended atomic 
blocks (EABs)” which are fanout-free cones of atomic blocks 
and remaining gates. To do so, we compute a directed graph 
G = (V,E) where the nodes correspond to the non-trivial 
atomic blocks, the remaining gates, and the outputs. There is 
an edge from a node v to a node w iff there is an output of the 
atomic block / gate corresponding to v which is connected to 
an input of the atomic block / gate / output node corresponding 
to w. We compute the coarsest partition {P,,...,P)} of V 
such that for all sets P; and all v € P; with more than one 
successor it holds that all successors of v are not in P;. We 
combine all gates / atomic blocks in P; into an EAB ea;. 

The computation of satisfiability don’t cares at the inputs 
of EABs that result from the input constraint JC (for dividers 
according to Def. 1 IC = 0 < R® < D-2"~!) is performed 
for EABs as described in [35] for atomic blocks. First of 
all, an intensive simulation (taking JC into account) excludes 
candidates for satisfiability don’t cares. Value combinations at 
inputs of EABs that are seen in the simulation are excluded, 
finally resulting in a set dc_cand(ea,;) for each EAB eaj. 
Satisfiability don’t cares at inputs of EABs are then computed 
by a series of BDD-based image computations [38] as shown 
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in Alg. 4, starting with JC’. In the end we have classified all 
don’t care candidates to be real don’t cares or not.’ 

If we apply the method to the optimized divider in Fig. 3, 
the EABs below the dashed line are shown by dashed boxes. 
The number of satisfiability don’t cares at the inputs of the 
dashed boxes (after constant propagation!) are shown at the 
right sides of the boxes just above the full adders. For the 
first EAB, the number of don’t cares is 9, e.g., whereas for 
the atomic block (full adder) included in the EAB the number 
is only 2. At first sight, it is not clear that more don’t cares 
really help during don’t care based optimization, but we will 
show in Sect. V that this is definitely the case and that the 
use of extended atomic blocks is essential for a successful 
verification of large dividers. 


B. Delayed Don’t Care Optimization 


In this section we introduce Delayed Don’t Care Opti- 
mization (DDCO). DDCO is based on the observation that 
don’t care optimization as introduced in [35] is a local 
optimization that does not take its global effects into account. 
If backtracking goes back to a backtrack point with don’t cares, 
then it backtracks to a situation where backward rewriting for 
an (extended) atomic block with don’t cares at its inputs has 
taken place and the inputs of this block have been brought into 
the polynomial. The optimization locally minimizes the size 
of the polynomial using those don’t cares immediately and the 
results of the optimization do not depend on rewriting steps 
which take place in the future. However, it is obvious that the 
future sizes of polynomials depend on the future substitutions 
during backward rewriting and therefore a local don’t care 
optimization may go into the wrong direction. For that reason 
we propose a delayed don’t care optimization taking future 
steps into account, which are performed after rewriting of the 
block for which the don’t cares are defined. Before we will 
introduce DDCO, we illustrate the effect by an example. 


Example 2. Consider the polynomial 


p = U1 L4t5 Xo + LQX4AIMH5 XG + UZ3LAT5 XE 


— T1L2L4L5L6 — L1L3ZL4LT5L6 — LIXZX4X5XE + C1 LQLXZL4LH XG 


with size 7. Assume that the valuation (#1, £2, £3, £4, £5) = 
(0,0,0,1,1) is a don’t care. By using the don’t care opti- 
mization method from [35] which was already illustrated in 
Example 1, we arrive at a polynomial 


q = p + vweqry — VX1L4L5 — VIQX4AX5 — VXUZBU4L5 + VLIL2L4T5 


+ V21L3U4U5 + VXQUZBU4AL5 — VTIT2T3Z TATS 


with a new integer variable v. Since there is no pair of terms 
in q with the same monomials, v = 0 leads to the polynomial 
with the smallest number of terms. For all v # 0 q has the 
size 15 instead of 7. This shows that a local don’t care op- 
timization with don’t care (£1, £2, £3, £4, £5) = (0,0,0, 1,1) 


‘It is easy to see that the don’t care computation from Alg. 4 can be 
extended to a verification of vc2 (similar to [35]) just by adding a final step 
computing the image x at the outputs. This way we obtain the image of the 
input constraint produced by the whole circuit. Then it has only to be checked 
whether x implies 0 < R < D. 


Algorithm 5 Rewriting with DDCO. 


Input: Specification polynomial SP’; Input constraint IC; Circuit CUV 
with EABs eag <top <top Eam in topological order ~<top; 
EABs ea; with input ye a a) ey rt Soe 
{ley dieu pet ies ae Ato) ah “delay” d 

Output: 1 iff specification holds for all inputs’ satisfying IC 

1: SPm:= SP"; i:=m+1; 

2: while i — 1 > 0 do 


don’t cares dc(ea;) = 


3: isi l; 

4: SP;—1ı := Rewrite(S P;, eai); 

3: for j = 1 to l; do ; ; 

6 SP;-1:= SPi-1+ vO f Il (4) Pate . Il (i) (1 — a); 
ey pat E3 k=? 

7: if i + d > m then continue; ; 

8: SP? := assign_de(SPi—1, gts“) = 0; uh? = 0); 

9: dcO_size := size(assign_ de( SP”, vlt) SOpiia ute =0)); 

10: if dcO_size < increase(size(SPj+a)) then 

11: for j = i — 1 to i + d — 1 do ; 

12: SP; := assign_dc(SP;, vit) = =0,..., ure = 0); 

13: else i F 

14: E ' ) = DC_opt(SP i"); 

15: eee die alee dor _ ui 

f i+d) _ „i+d i+d) _ „itd 
16: SP; := assign_dc(SP;, vi ar es ye = Ziga” 
17: Sy := assign_de(S Po, v\” =0,. re = 0); 


18: return evaluate(S Po); 


does not help in this example. Now assume that we perform a 
replacement of xg by x4: £5 in the polynomial q, resulting in 


qd = vzazs + (1 — v)zigzazs + (1 — v)r2x4zs + (1 — v)r3t4zs 
+ (v — 1l)ayrergns + (v — l)xizgr4zs + (v — 1)zr2rg3z4zs 
+ (1 — v)a1 22032425 


Here it is easy to see that choosing v = 1 reduces q' to 
q = z425. Le., performing local don’t care optimization 
before rewriting with xg = x4: xs did not help and leads to a 
polynomial with 7 terms after the rewriting step, but don’t care 
optimization after the rewriting step reduces the polynomial 
to a single term. By generalizing the example from 6 to an 
arbitrary number of n variables, we obtain 2°~* — 1 terms 
with don’t care optimization before rewriting and one term 
with don’t care optimization after rewriting, which shows that 
delayed don’t care optimization can be exponentially better 
than local don’t care optimization (even for a delay by one 
step only). 


Alg. 5 shows an integration of DDCO into backward rewrit- 
ing. In contrast to Alg. 3, it does not use backtracking and it 
always “delays” don’t care optimization by d EAB rewriting 
steps. In the while loop from lines 2 to 16, don’t care terms 
with fresh integer variables vw are immediately added to the 
polynomial SP;—ı for each don’t care of the current EAB ea; 
(line 6), but those don’t cares may only be used with a delay of 
d EAB rewritings, i.e., in the iteration replacing ea; only don’t 
cares coming from ea;;q may be used. Therefore, younger 
don’t care variables are temporarily assigned to O in line 8, 
leading to a polynomial S PEP . Now the size of S P;+a (which 
is the polynomial before rewriting with ea;+q) is compared to 
the size dcO_size of SP;—ı where the don’t care variables 
from ea;+q are assigned to 0 as well (i.e., they are not used). 
If dcO_size did not increase too much compared to the size of 
S Pipa (‘too much” is specified by a monotonically increasing 
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function increase), then the don’t care variables from eaj+a 
are permanently assigned to 0 (lines 11 and 12) in the current 
as well as all previous polynomials containing those variables. 
Otherwise, the known ILP based don’t care optimization is 
used and its results are inserted into S$ P;_; and again also in 
all previous polynomials containing the don’t care variables 
from ea;+¢q (lines 14 to 16). 


V. EXPERIMENTAL RESULTS 


Our experiments have been carried out on one core of an 
Intel Xeon CPU E5-2643 with 3.3GHz and 62GiB of main 
memory. The run time of all experiments was limited to 24 
CPU hours. All run times in Tables I, II and III are given in 
CPU seconds. We used the ILP solver Gurobi [39] for solving 
the ILP problems for don’t care optimization of polynomials. 
For image computations we used the BDD package CUDD 
3.0.0 [40]. For benchmarks and binaries see [41]. 

In our experiments we consider verification of three dif- 
ferent types of divider benchmarks with different bit widths 
(Cols. 1 in Tabs. I to IID). Tab. I shows results for non-restoring 
dividers “non-restoring;” as seen in Fig. 3 (with the gray 
full adder included), which were also used in [35]. Table II 
contains results for further optimized non-restoring dividers 
“non-restoring,” that omit the gray full adder shown in Fig. 3. 
Table II gives results for restoring dividers. All three tables 
share the same column labels. Note that we did not make 
use of any hierarchy information during verification, but only 
used the flat gate-level netlist (numbers of gates are shown in 
Cols. 2) and employed heuristics for detecting atomic blocks 
as well as for finding good substitution orders [15], [16]. 

We begin with three experiments for comparison where we 
check the equivalence of the divider circuits with a “golden 
specification”. In those experiments we restrict counterexam- 
ples to the allowed range 0 < R© < D-2"~! of inputs. 

In the first experiment we used a SAT-solver (MiniSat 2.2.0 
[42]) to solve the corresponding satisfiability problems. The 
results from Cols. 3 in Tabs. I, II, and HI show that SAT- 
solving is hard for non-trivial arithmetic circuits and none 
of the benchmarks with bit widths larger than 8 could be 
solved in the specified time limit. In the second experiment 
we considered the combinational equivalence checking (CEC) 
approach of ABC [43], [44]. Since it is based on And-Inverter- 
Graph (AIG) rewriting via structural hashing, simulation, and 
SAT, the equivalence checking between two designs is reduced 
to finding equivalent internal AIG nodes. As for SAT-solving, 
ABC cannot verify the dividers with bit widths larger than 8, 
see Cols. 4 in Tabs. I, II, and III. In a third experiment we 
used a commercial verification tool. As Cols. 5 in Tabs. I, II, 
and III show, the commercial tool is able to verify also 16-bit 
dividers, for the restoring dividers it even verifies the 32-bit 
divider in about 15 CPU hours, but does not finish within the 
time limit for larger dividers. 

From Col. 6 in Tab. I we can see that the method from [35] 
performs very well for the verification of the non-restoring; 
dividers. Col. 7 (“#bt”) shows how many backtrack operations 
were actually performed. For the non-restoring2 benchmarks 


considered in Tab. II the method exceeds the available memory 
for 16 bits and larger, for the restoring ones from Tab. II even 
already for 8 bits. As already shown by our analysis from 
Sect. III (see Fig. 5), equivalence/antivalence computation 
and don’t care optimizations on atomic blocks as used in 
[35] are not strong enough to avoid exponential blowups 
of polynomials for the non-restoring2 dividers. For restoring 
dividers the situation is similar. 

In the next experiment we evaluate our new approach of us- 
ing EABs for don’t care computation instead of atomic blocks 
as used in [35] (at first without DDCO). For non-restoring; 
dividers (where the method from [35] already performed very 
well) this approach is somewhat slower than the original 
method, see Cols. 6 and 8 of Tab. I. The reason for this is that 
using EABs instead of atomic blocks as in [35] leads to more 
blocks where don’t cares are applicable whereas the number 
of don’t care optimizations which are really necessary stays 
the same. This can be seen in Cols. 7 and 9 of Tab. I which 
compare the number of performed backtracks. The version 
with EABs performs additional backtracks to backtrack points 
where optimization does not help and it has to store a larger 
amount of backtrack points. This even leads to running out of 
available memory for the 512-bit instance of non-restoring,. 
But on the other hand already the usage of EABs enables 
to verify the non-restoring, dividers from Table II up to 
256 bits in about 2 hours. Since don’t care optimizations 
on atomic blocks as used in [35] are not strong enough to 
avoid exponential blowups for the non-restoring2 dividers (as 
already mentioned above), using EABs is inevitable. However, 
the approach is not able to verify restoring dividers with 
bit widths larger than 64, see Col. 8 in Table III, due to 
increasing run times and memory consumption. This can be 
explained by the larger number of EABs with non-empty 
don’t care sets for restoring dividers compared to non-restoring 
dividers. These numbers are given in Cols. 10 (“#EABs with 
DCs”) of Tabs. I and II for the non-restoring dividers and in 
Col. 10 of Tab. III for restoring dividers. The numbers grow 
only linearly for non-restoring dividers, but quadratically for 
restoring dividers. More EABs with non-empty don’t care sets 
lead to an increased memory consumption by storing more 
backtrack points and to increased run times consumed by 
extensive backtracking. The effect occurring here has already 
been illustrated in Example 1 of Sect. III-B where we have 
to perform an exponential amount of unsuccessful backtracks 
before finally arriving at the relevant don’t care optimization. 
For the 64-bit non-restoring, divider, e.g., the approach needs 
less than 50 seconds with 205 backtracks (Cols. 8, 9 of Tab. II) 
whereas the corresponding restoring divider only finishes in 
about 15 minutes with 3047 backtracks (Cols. 8, 9 of Tab. II). 

Cols. 12 of Tabs. I, II, and HI show that those difficulties can 
be overcome by using our novel DDCO method. It turned out 
that already the simplest possible parameter choice of d = 1 
and increase(size) = size+1 in Alg. 5 is successful. We were 
even able to verify the 256-bit restoring divider in less than 9.5 
CPU hours and both 512-bit instances of non-restoring; and 
non-restoringz could be verified in about 7.5 hours. Comparing 
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TABLE I 
VERIFYING DIVIDERS NON-RESTORING1 FROM [35], TIMES IN CPU SECONDS. 


Our method = [35]+EABs+DDCO 
n #Gates SAT ABC Com. [35] [35]+EABs #EABs #DC peak 
time time time time #bt time #bt with DCs opt. time poly. 
4 100 0.22 0.01 1.23 0.15 7 0.44 12 12 5 0.23 128 
8 404 68.58 17.65 1.33 0.39 11 1.21 37 28 9 0.94 199 
16 1,588 TO TO 165.87 1.59 19 3.26 83 60 17 1.87 407 
32 6,260 TO TO TO 5.06 35 12.10 166 124 33 6.78 1,207 
64 24,820 TO TO TO 21.88 67 96.15 365 252 65 28.24 4,343 
128 98,804 TO TO TO 114.73 131 1,434.11 909 508 129 153.71 16,759 
256 394,228 TO TO TO 825.11 259 13,656.97 2,077 1,020 257 1,985.05 66,167 
512 1,574,900 TO TO TO 9,183.28 | 515 MO - 2,044 513 27,370.60 | 263,287 
TABLE II 
VERIFYING DIVIDERS NON-RESTORINGg, TIMES IN CPU SECONDS. 
Our method = [35]+EABs+DDCO 
n #Gates SAT ABC Com. [35] [35]+EABs #EABs #DC peak 
time time time time #bt time #bt with DCs opt. time poly. 
4 96 0.23 0.01 1.21 0.17 8 0.26 17 11 5 0.23 61 
8 400 31.83 16.78 1.86 2,486.89 31 0.99 21 27 9 0.95 117 
16 1,584 TO TO 108.23 MO - 2.68 51 59 17. 2.17 325 
32 6,256 TO TO TO MO - 9.36 102 123 33 7.25 1,125 
64 24,816 TO TO TO MO - 49.41 205 251 65 26.87 4,261 
128 98,800 TO TO TO MO - 340.85 397 507 129 149.75 16,677 
256 394,224 TO TO TO MO - 7,341.86 1,053 1,019 257 1,691.72 66,085, 
512 1,574,896 TO TO TO MO - MO - 2,043 513 | 27,351.10 | 263,205 
TABLE II 
VERIFYING RESTORING DIVIDERS, TIMES IN CPU SECONDS. 
Our method = [35]+EABs+DDCO 
n #Gates SAT ABC Com. [35] [35]+EABs #EABs #DC peak 
time time time time #bt time #bt with DCs opt. time poly. 
4 140 0.27 0.01 1.21 2.59 17 0.47 35 16 8 0.38 61 
8 700 14.88 14.27 1.49 MO - 1.77 45 64 16 1.42 117 
16 3,068 TO TO 16.39 MO - 8.41 171 256 32 6.63 325 
32 12,796 TO TO 53,277.73 MO - 65.99 727 1,024 64 29.02 1,125 
64 52,220 TO TO TO MO - 885.71 3,047 4,096 128 193.40 4,261 
128 210,940 TO TO TO MO - MO - 16,384 256 2,244.24 16,677 
256 847,868 TO TO TO MO - MO - 65,536 512 | 33,593.30 | 66,085 
512 3,399,676 TO TO TO MO - MO - 262,144 - TO - 


the numbers of EABs with non-empty don’t care sets (Col. 10, 
“#EABs with DCs”) with the actual numbers of don’t care 
optimizations performed (Col. 11, “#DC opt.’) in Tab. II, 
we observe that in particular for restoring dividers DDCO 
performs don’t care optimizations only for a small fraction 
of the EABs with non-empty don’t care sets. The effect is 
visible especially for larger instances. For the 256-bit divider 
this percentage is less than 1%, e.g.. 

Finally, Cols. 13 give the peak polynomial sizes during 
backward rewriting, counted in number of monomials. It can 
be observed that these peak sizes grow quadratically with the 
bit width. This shows that our methods are really successful 
in keeping the polynomial sizes small, since already the 
specification polynomial is quadratic in n. 

In summary, the presented results show that our new method 
is able to successfully verify not only the divider benchmarks 
from [35], but also new divider architectures for which the 
previous approach fails. 


VI. CONCLUSIONS AND FUTURE WORK 


We analyzed weaknesses of previous approaches that en- 
hanced backward rewriting in a SCA approach with forward 
information propagation and we presented two major contribu- 


tions to overcome those weaknesses. The first contribution is 
the usage of Extended Atomic Blocks to enable stronger don’t 
care computations. The second one is the new method of De- 
layed Don’t Care Optimization which has two benefits: First, 
it performs don’t care optimizations in a more global rewriting 
context instead of seeking for only local optimizations of 
polynomials, and second it is able to effectively minimize the 
number of don’t care optimizations compared to considering 
all possible combinations of using / not using don’t cares of 
EABs which can potentially occur in a backtracking approach. 
We showed that our new method is able to verify large divider 
designs as well as different divider architectures. For the 
future, we believe that the general approach of combining 
backward rewriting with forward information propagation will 
be a key concept to verify further divider architectures as well 
as other arithmetic circuits at the gate level. 
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Abstract—Every computer having a network, USB or disk 
controller has a Direct Memory Access Controller (DMAC) which 
is configured by a driver to transfer data between the device and 
main memory. The DMAC, if wrongly configured, can therefore 
potentially leak sensitive data and overwrite critical memory to 
overtake the system. Since DMAC drivers tend to be buggy (due 
to their complexity), these attacks are a serious threat. 

This paper presents a general formal framework for modeling 
DMACs and verifying under which conditions they are isolated. 
These conditions can be used as a specification for guaranteeing 
that a driver configures the DMAC correctly. The framework 
provides general isolation theorems that are common to all 
DMACs, leaving to the user only the task of verifying proof 
obligations that are DMAC specific. This provides a reusable 
verification infrastructure that reduces the verification effort of 
DMACs. Models and proofs have been developed in the HOL4 
interactive theorem prover. To demonstrate the usefulness of the 
framework, we instantiate it with a DMAC of a USB. 

Index Terms—formal verification, interactive theorem proving, 
DMA, I/O security, memory isolation 


I. INTRODUCTION 


Direct memory access controllers (DMACs) are hardware 
components transferring data between memory and I/O de- 
vices (e.g. memory-to-memory copies, and data transfers to 
and from network interface cards, USB, disks, and graphics 
accelerators). Without a DMAC, the CPU must perform these 
data transfers, spending time on data transfers rather than 
on applications, decreasing performance significantly [1]-[3], 
[44]. DMACs can also reduce power consumption since a CPU 
is more power demanding than a DMAC [4], [5], [44]. 

Since DMACs can access memory, where critical data and 
code are located, they can be used by attackers to overtake or 
crash the system. Examples include abusing a GPU DMAC to 
gain privilege escalation [9] and a network interface DMAC to 
crash Linux [10]. To prevent DMAC attacks, many formally 
verified high-security hypervisors and operating systems [23]- 
[30] either disable DMACs or rely on IOMMUs (mem- 
ory management units [15]-[17] placed between the DMAC 
and memory). The use of IOMMUs have three significant 
disadvantages: not all hardware platforms have IOMMUs; 
it negatively impacts performance and further reduces time 
predictability (due to additional translation table walks [18], 
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[19]); and it requires additional non-trivial (potentially buggy 
[20]-[22]) software for configuring and protecting page tables 
and associated data structures. 

Verifying memory safety in presence of DMACs and ab- 
sence of IOMMUs require formal models of the DMAC 
hardware including the interface between DMAC, software 
and memory. Such models allow reasoning about the effects 
of software accessing DMAC registers, of DMAC memory 
accesses, and the interaction between of software and DMAC 
which share data structures in memory. 

We present a general framework for modeling DMACs 
(Section III). The framework is implemented in the HOL4 
interactive theorem prover [31] and includes a general DMAC 
model which can be instantiated to a given DMAC by defining 
14 DMAC specific functions (the most significant ones are 
listed in Table II). This generalization allows us to identify 
and verify sufficient conditions to confine DMAC memory 
accesses to certain memory regions. 

To achieve this general verification result, in Section IV 
we establish a refinement between an abstract DMAC model, 
which is easier to analyze, and identify sufficient conditions to 
preserve the refinement that must be satisfied by the DMAC 
instantiation and the DMAC driver. This strategy has three 
main benefits: (1) the refinement theorem can be reused to 
verify functional correctness of drivers using the abstract 
model; (2) the verification of the instantiation deals only with 
the identified sufficient conditions and do not have to deal with 
the entire transition system of the DMAC model; and (3) the 
software conditions can be verified using the abstract model. 

In order for the framework to be as general as possible, we 
have reviewed numerous DMACs (Table I). In Section V we 
demonstrate our approach by instantiating the framework with 
the USB DMAC in an SoC from Texas Instruments [32]. We 
use our result to identify the conditions that must be satisfied 
by a driver or a security monitor. The use of the framework 
has largely reduced the time for analyzing the USB DMAC. 

Finally, in Section VI we discuss the HOL4 implementation 
and the security analysis of the Linux USB DMAC driver. 


II. BACKGROUND 


DMACs perform memory accesses by operating on a queue 
of buffer descriptors (BDs), illustrated in Fig. 1, which are 
initialized by the driver. Each BD contains information about 
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Fig. 1. DMAC 


a memory transfer and the status of that transfer. The queue 
can be stored either in internal DMAC memory or in external 
main memory, either as a linked list (potentially cyclic), a ring, 
or as an array. Once the driver has initialized the BDs, the 
driver signals the DMAC to start operating on the BDs in the 
queue, which is done by a pipeline consisting of four stages. 
(1) Fetch: The DMAC fetches the BD into internal CPU- 
inaccessible memory. (2) Update: If the BD is operated on 
in multiple rounds, then the DMAC updates the BD to reflect 
the remaining transfers to perform for subsequent rounds. (3) 
Process: The DMAC performs the direct memory accesses 
(DMA transfers) to the buffers in main memory as specified 
by the BD. (4) Write back: If all memory accesses specified 
by the BD have been performed, the DMAC writes back the 
BD to signal the driver that the BD has been processed and 
can be reused for new transfers. 

In the following we use O = {f,u,p,w} to refer to 
these four operations. The DMAC may also perform memory 
accesses due to maintenance operations, for example to store 
statistics or management data in memory. These operations 
are not atomic and may require multiple memory accesses. 
Furthermore, DMACs may be able to work on multiple queues 
of BDs concurrently, where each queue constitutes one DMA 
channel, and each channel may have more than one BD in 
each of its pipeline stages. 

Both the driver and the DMAC can read and modify the 
queues: The driver reads the status of existing BDs and 
appends new BDs; the DMAC reads and updates BDs. For 
this reason verifying properties of this kind of system is 
challenging and similar to verifying concurrent threads sharing 
memory. In order to control the complexity caused by the 
interleaving of these the CPU/driver and the DMAC, the 
verification must exploit some sort of rely/guarantee [6], that 
enables verification of each component in isolation while 
assuming properties of the other component. Our verification 
approach follows this strategy, showing that there are sufficient 
conditions (rely) that if met by the driver allow to restrict 
(guarantee) the memory accesses of the DMAC. 


A. DMAC Characteristics 


In order to support a wide range of DMACs, our general 
model must accurately describe the memory accesses that may 
be performed by an arbitrary DMAC. To identify the common 
features of DMACs, we studied eight stand-alone DMACs, six 
embedded in USB controllers, and five embedded in Ethernet 
controllers, and the DMAC of IBM Cell, some characteristics 
of which are listed in Table I. The main difference among the 


Stand-alone DMACs 
Chip BD Organization BD Location 
Texas Instruments AM335x Linked list Internal memory 
Microchip PIC32 Family Linked list Internal memory 
Xilinx AXI DMA v7.1 Linked list Main memory 
NXP MPC5675/KMPC57xx Linked list Internal memory 
Infineon GPDMA Linked list Main memory 
Broadcom BCM2835 Linked list Main memory 
ST Microelectronics STR91xFA Linked list Main memory 
Texas Instruments TMS320C5515 | Linked list Main memory 
IBM Cell BE Array/Ring Main memory 
USB DMACs 
Chip BD Organization BD Location 
Cypress EZ-USB FX3 Linked list Main memory 
Xilinx Zyng-7000 Linked list Main memory 
Texas Instruments AM335x Linked list Main memory 
NXP SAF1761 USB OTG One BD Internal memory 
STM32F72xxx/STM32F73xxx One BD per channel | Internal memory 
Microchip PIC32 Family Ring Main memory 
NIC DMACs 
Chip/Board BD Organization BD Location 
Texas Instruments AM335x Linked list Internal memory 
Broadcom NetXtreme/Netlink Ring Main memory 
Realtek Ethernet RTL8100 Ring Internal memory 
3Com 3C90x/B Linked list Main memory 
Intel e1000/e, X550, 1350, 1210 Ring Main memory 
TABLE I 


STUDIED DMACs. 


DMACs is the mechanism used to organize BD queues: 13 
DMACs use linked lists; five use ring buffers; and two use 
queues of one single BD. Moreover, seven DMACs store the 
queues in internal memory and 13 store the queues in main 
memory; Furthermore, DMACs have different: internal states 
(e.g., address pointers, counters, and state machines); number 
of DMA channels; reactions to register accesses made by the 
CPU; scheduling of channels; BD format (e.g. fields for buffer 
start address and size); and behavior of the four pipeline stages 
(fetch, update, process, and write back). 


B. Security Threat from DMACs 


Without an IOMMU, a DMAC can access memory without 
restrictions. For instance, consider a microkernel (or a hy- 
pervisor), where a user-mode driver (or a guest) should not 
be able to directly access kernel memory. If the driver can 
directly configure a DMAC that can perform memory-memory 
transfers, then the driver could store a malicious program in its 
own memory, and configure the DMAC to transfer this buffer 
to the exception handling table of the kernel. This results 
in code injection, bypassing the normal protection provided 
by the MMU that prevents direct tampering from the driver. 
Similarly, the driver of an Ethernet controller may overwrite 
kernel data structures with an incoming network packet or to 
leak data in kernel memory. 

In order to isolate a DMAC, its configuration must meet 
three sufficient conditions, which are all violated by the 
example of Fig. 2: 


1) BDs specify DMA reads and writes to buffers that are 
considered “readable” and “writable”: BD1 can instruct 
the DMAC to violate isolation since part of the buffer 
is outside the allowed memory region. 
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Fig. 2. DMAC isolation violations. Readable and writable region is colored 
in gray. 


2) If BDs are stored in main memory, then the BDs must 
be located in “readable” and “writable” memory and 
must not specify DMA writes to BDs: The DMAC will 
violate memory isolation when fetching, updating and 
writing back BD1. Also, the DMAC can modify BD3 
while processing BD1, since BD3 overlaps the buffer 
addressed by BD1. 

Basically, these conditions guarantee that the BDs “instruct” 
the DMAC to access only “readable” and “writable” memory, 
and that the DMAC cannot change such BD “instructions”. 


III. GENERAL DMAC MODEL 


We assume a computer system to be the composition c|m|d, 
where each component represents the state of a CPU, a mem- 
ory and a DMAC respectively. We use standard synchronous 
composition of the transition systems of the components 
(assuming that parallel composition is associative, symmetric, 
and commutative): 


i 
yy 
aly > 2" |y’ 


TOO loos 
ie a bie a bs 


aly > a'ly 
The labels of these transition systems are 7 for internal 
operations, and rd(as, bs)/wt(as, bs) for reading/writing the 
bytes bs at/to the locations with addresses as, where the latter 
two have co-labels rd(as, bs) and wt(as, bs). 

We do not explicitly define the CPU model. This model 
could for instance be the formalization of an Instruction Set 
Architecture (ISA) or a more abstract model of a device 
driver. Memory is an array of bytes, where M represents the 
addresses of the main memory: 


as CT M as CT M 
rd(as,m[as]) n m UO mlas + bs] 


Notice that we use early semantics: the memory is always 
ready to receive a memory update non-deterministically se- 
lecting all possible bytes bs. This non-determinism is resolved 
when the the memory transitions system is composed with 
another transition system that performs a write. 


A. DMAC Transition System 


The DMAC state consists of three components, d = (s, b, c): 
An internal state s, whose type depends on the specific DMAC; 
a messsage box b containing memory requests and replies; and 
a DMA channel c (the model supports multiple channels, but 
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Fig. 3. DMAC model. 


we omit them for simplicity). We will use Fig. 3 to illustrate 
the model, where a queue of five BDs has been configured in 
main memory and each BD points to a buffer. 

The message box b allows the DMAC to operate asyn- 
chronously w.r.t. the memory. This box is a set of memory 
read and write requests and replies: r?” [as], w” |as, bs] and 
p;’ [bs], where op, t, as and bs denote: The DMAC pipeline 
or mainteinance operation O U {m} that issued the memory 
request or that shall have the reply; a memory request-reply 
identifier tag; addresses to read/write; and bytes read/written. 

The component c : O \ {f} — B models the DMAC 
pipeline. In the following we use c.op to denote c(op). Hence, 
c.u = |bdı, ..., bdn] denotes the queue of BDs in the update 
stage, with n arbitrary and n = 0 denoting an empty queue; 
and similarly for p and w. We call these abstract BDs, since 
they are records whose type depends of the specific DMAC 
and contain the same information that is stored by the BDs 
in main memory. Independently of the DMAC instantiation, 
a BD bd always contains four mandatory fields specifying 
the addresses of the locations: where it is stored bd.ra, that 
are updated when it is written back bd.wa (e.g., the address 
of its completion flag), and of the buffer that must be read 
and written via DMA, bd.dra and bd.dwa. The BDs in c are 
the ones that have been fetched with each BD being in some 
DMAC pipeline stage. For instance, in Fig. 3 three BDs have 
been fetched and are therefore in the DMAC pipeline (bd2 and 
bd3 are being processed and bd1 is currently written back). We 
use “pending” BDs to refer to the BDs in the queue that are 
left to fetch (e.g. BD4 and BDS5). Normally, the concatenation 
of the queues in c represents a sliding window of the queue 
in memory. 

To account for the DMAC specifics the rules describing 
DMAC transitions are defined in terms of two records. The 
record A contains behavioral functions that model the spe- 
cific actions of a DMAC. The record II contains projection 
functions that extract information from the state and returns 
the proper data structures (e.g., BDs). These DMAC specific 
functions must be defined to obtain a concrete DMAC model. 
Table II summarizes the behavioral functions (except for a 
scheduler that resolves non-determinism) and the two most 
important projection functions. 
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Function 
A.rr 


Modeled Operation/State Information 

(See rule [rr]) Given an internal state sı and the addresses 
as of the DMAC register to read, returns an updated internal 
state s2, the read bytes bs, and potential maintenance memory 
requests rs associated with the read. 

(See rule [wr]) Given an internal state s1, the addresses as of 
the DMAC register to write, the bytes bs to write, returns an 
updated internal state s2 and potential maintenance memory 
requests rs associated with the write. 

(See rule [f;]) Given an internal state s, returns the memory 
read request rf [as] for fetching the next part of the BD being 
fetched at addresses as and with request identification tag t. 
(See rules [fe] and [f3]) Given an internal state s1, and 
for external BDs a fetch reply pi [bs], (where the bytes bs 
constitutes a part of the currently fetched BD and with t 
being the request identification tag of the corresponding read 
request), but L for internal BDs; returns an updated internal 
state s2, and either a fetched BD bd and the stage op € {u, p} 
the BD shall be moved to, or L if additional external or 
internal memory reads are necessary to fetch the next BD. 
(See rule [pz]) Given an internal state s1, the first BD bd in 
the process stage and whose memory transfers are currently 
being performed (i.e., the DMA transfers specified by bd), 
and the DMA read replies ps associated with the process 
stage; returns an updated internal state s2 reflecting the 
processing of the given memory replies and the generatation 
of potentially new memory requests rs, and a boolean flag 
indicating whether all requests/replies associated with bd have 
now been issued/processed and the BD shall be moved to the 
write back queue. 

(See rule [w]) Given an internal state s1, and the BDs in the 
write back queue c.w; returns an updated internal state s2, 
the memory write requests rs containing the bytes to write to 
memory associated with any given BD (not used for internal 
BDs), and the BDs bds that are now released due to the write 
back (removed from the write back queue). 


A.wr 


IL fas 


AT 


(See rule [m]) Given an internal state s; and memory read 
replies ps (to read requests issued by [rr] and [wr]); returns 
an updated internal state s2 and the processed replies pps 
that shall be removed from the message box. 

(See rules [w] and [ma] in Subsection IV-A) Given internal 
state s and memory m, returns the pending BDs bds that 
remains to fetch (bds = [BD4, BDS] in Fig. 3). 


TABLE II 
SUMMARY OF THE DMAC SPECIFIC FUNCTIONS. 


Il.cf 


In the following we use D to represent the set of addresses 
of DMAC registers. The reaction of the DMAC when the CPU 
accesses such a register at addresses as is DMAC specific and 
must be described by the Read Register and Write Register 
functions: A.rr and A.wr. Notice that these functions can 
affect the internal state of the DMAC and may return memory 
requests rs in case a register access makes it necessary for the 
DMAC to update maintanence data in main memory (c = a+b 
denotes c = aU {b} Aa N {b} = Ø): 


(s2,bs,rs) = A.rr(s1,as) asCD 
[rr] 
(s1, b, c) oo (s2, b+ TS, c) 
(s2,rs) = A.wr(s1, as, bs) asCD 
- [wr] 
(s1, b, c) —, (so, b+ rs, c) 


The message box acts as a buffer between the memory 
and the DMAC. The message box synchronizes with memory, 


consuming a request (previously produced by operation op and 
with identifier t) and for reads adding a correponding reply: 


(s, b+ rPjas], c) 20%), (s, bt pPjbs], e) [rm] 
(s,b + wy? las, bs], c) laste (s,b,c) [wm] 


The other rules are for internal DMAC transitions. For 
fetching BDs (op = f) there are five cases: three if BDs are 
stored in main memory and two if BDs are stored in internal 
memory. [f;] describes the first step in fetching an external 
BD, that is applicable when there are no pending memory 
replies for BD fetches. In this case a memory request is added 
to the message box for fetching new BDs. In Fig. 3, the rule 
can produce the request RQ f when starting to fetch BD4. 
The addresses and the tag are given by the function Fetch 
Addresses II.fas: 


{p} [bs] €b} =O rf [as] = IL.fas(s) 
(s,b,c) > (s,b+ r! [as], c) 


[f1] 


When a memory read request for fetching a BD is served, 
the corresponding reply is added to the message box. [fo] 
describes the behavior when such a reply exists but more reads 
are necessary to fetch the complete BD, in which case the 
function Fetch A.f returns L. A.f can update the internal 
state with the consumed reply, which contains a partial BD: 


(so, L) = A.f (s1, př [bs]) 


(s1,b + pi [bs], c) 2 (s2, b,c) UP] 


[fs] handles the case when a BD fetch reply pf [bs] exists 
and it contains the last chunk of bytes bs of the BD bd 
being fetched. In this case A.f returns a pair consisting of 
the abstract representation of the fetched BD bd and which 
pipeline stage queue op € {u, p} the BD shall be appended to 
(denoted by ++): 


(s2, (bd, op)) = A.f (s1, pf [bs]) 
(s1,b + pf [bs], c) = (s2, b, clop + c.op + bd]) 


The fetching BD rules for DMACs with internal BDs are 
similar to [fə] and [f3], but no memory requests and replies 
are involved, since BDs are obtained from the internal DMAC 
state. 

Two rules model the process stage (op = p), depending 
on whether the currently processed BD is now completed 
or not. The following rule covers the case when a BD 
is completely processed (the other case when more DMA 
transfers remain of the BD is similar, but keeps the BD at the 
head of the process queue). In either case, the function Process 
A.p models the DMAC specific behavior of generating and 
processing memory requests and replies. It takes the currently 
processed BD bd at the head of c.p, and pending memory 
replies for the process stage; and returns an updated internal 
state, optional new memory requests rs, and a completion flag 
which specifies if the BD has now been processed and shall be 
moved to the write back stage. These requests represents DMA 


[fs] 
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reads and writes, while the replies are the results of previously 
issued read requests that have been served by memory. All 
replies are consumed and the new requests are added to the 
message box. In Fig. 3 the rule can produce the request RQp 
to write the buffer addressed by bd2. 


c.p = bd :: bds 
ps = {pf [bs] € b} (s2, rs, true) = A.p(s1, bd, ps) 


(s1,6,c) 5 (s2,b — ps + rs, clp > bds, w > c.w + bd]) 


[pt] 


Updating and writing back BDs are similar and for this 
reason we only describe write back in detail. The main 
difference is that updating a BD moves the updated BD from 
the head of update queue to the tail of the process queue, while 
a write back may remove a (possibly empty) prefix of BDs 
from the write back queue c.w. If BDs are stored in main 
memory, the Write back function A.w returns the memory 
write requests rs for writing back the BDs, while internal 
BDs are written back by updating the internal state (in Fig. 3 
the rule can produce the request RQw to update bd1 in main 
memory): 


(Ss, rs, bds) = A.w(s1,c.w) 


w 
(s1,b, c) S (s2, b + rs, clw > c.w — bds]) [w] 
Finally, the DMAC can react to the replies ps to the 
read requests produced by the mainteinance operations (i.e., 
requests issued by [rr] and [wr]), removing the processed 
replies pps C ps from the message box: 


ps = {pi [bs] €b} — (s2, pps) = A.m(s1, ps) 
(s1, b, c) 5 (s2, b — pps, c) 


[m] 


IV. VERIFICATION 


Our goal is to verify general conditions that are sufficient 
to guarantee DMAC isolation (Theorem 1): The DMAC can 
only read “readable” and write “writable” memory regions, 
denoted by the sets of addresses R and W. 

Our verification is based on refinement. Let M3 be the 
DMAC model defined in Section III. We introduce two layered 
abstractions M> and Mı. For each model M;.1 we introduce 
an invariant Z;, 1 that allows us to prove bisimulation between 
Mi+ı and M;. We finally introduce an invariant Z, for Mı 
that demonstrate DMAC isolation and use the bisimulation 
to transfer this property down to the M3 DMAC model. 
This strategy has three benefits: (i) it allows us to solve 
one problem at a time via a single refinement step; (ii) it 
establishes a bisimulation between the concrete model and 
the more abstract one, which allows further properties (e.g., 
functional correctness of a device driver) to be verified using 
abstract models; (iii) it allows us to identify assumptions that 
all DMAC instantiations and drivers must satisfy in the form 
of proof obligations. The obligations must be proved for a 
given DMAC instantiation, but these proofs depend only on 
the instantiation (A and IJ) in contrast to a complete DMAC 
model. The driver conditions can be proven relying only on 
the DMAC guarantee that are established by our verification. 


A. Abstract DMAC Models 


The lower abstraction Mə is a virtual DMAC that cannot 
self-modify pending BDs. This property allows a driver to 
prepare, extend, and read the queue that must be fetched by 
the DMAC without being concerned that the DMAC may alter 
the queue. This is done by checking that pending BDs are not 
addressed by BD updates, write backs, and DMA writes. For 
instance, the rule for write back becomes (where a $b means 
that sets a and b are disjoint: aN b = 9): 


(s2, 7s, bds) = A.w(s1,¢.w) 


U bd.wa U U as | Æ U 


bdebds wg” [as,bs]Ers bdEIl.cf(m,s1) 


bd.ra 
[w] 


The rule prevents write backs from modifying pending BDs, 
independently of whether the BDs are stored in internal or 
main memory. For internal BDs, the locations modified by 
A.w are identified from the list of released BDs bds. For 
external BDs, the addresses are in the requests rs produced 
by A.w. Il.cf returns the list of remaining (Concrete) pending 
BDs to Fetch, as identified by the internal state and memory 
(BD4 and BD5 in Fig. 3). 

The upper abstraction Mı guarantees that BDs cannot be 
changed by the CPU. The pending BDs to fetch are stored in 
an abstract queue c.f. By definition the CPU cannot modify 
or remove entries from this list, but it can append BDs by 
either: writing a DMAC register (e.g. by writing the tail pointer 
register or by writing the next pointer field of a BD in external 
memory). This makes it possible to prove properties of DMA 
transfers (e.g., memory isolation) without considering inter- 
leavings with CPU transitions which can potentially corrupt 
pending BDs. This abstract model alters the previous transition 
system by composing the abstract DMAC and memory in 
such a way that the abstract DMAC can “magically” extend 
the abstract queue of pending BDs with new BDs bds when 
the CPU writes memory mı at locations with addresses as 
and bytes bs resulting in memory Mə (writing registers is 
similar but with the updated internal state considered instead 
of updated memory): 


as CM 
bds’ = IIL.cf (ma, s) 


wt(as,bs) 
——> 


(s1,b,c) 5 (s2,b + rs, clw > c.w — bds}) 


Mz = mı fas + bs] 


dbds. bds' = c.f + bds 


[ma] 
m |(s, b,c) ma|(s, b, cf f > bds']) 

The internal operations of Mı also differ. For [f3], the BD 
bd returned by A.f is ignored and instead the first BD of c.f 
is moved to c.u or c.p, depending on whether the BD shall be 
updated or not. The reason why main memory is still accessed 
to fetch BDs (even though they are not used) is to keep the 
transition systems synchronized: Internal states are updated 
identically in both Mı and Mə. In addition, the checks for 
updates/write backs and DMA writes in Mə are also in M4. 


B. Refinement Relations, Invariants, and Proof Obligations 


We use (m, di+1) ~i+1 (M, di) for the refinement relation 
between M4, and M;. These relations require the common 
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state fields to be equal: dj,, = d;. Additionally (m, d2) ~s 
(m, dı) requires that the abstract and concrete pending BDs 
are equal: d,.ch.f = I. cf (m, dz.s). 

The refinement proofs depend on invariants that restrict the 
state of the lower layer. The invariant for Ma requires that no 
DMA write request targets pending BDs Z2(m, s, b,c) = 


wp” |as, bs] € bA bd € Icf (m,s) => as %bd.ra 


This invariant simply propagates the checks of the internal 
abstract DMAC operations (e.g., [w] of Mə). 

In order to establish the bisimulation for the model of 
Section III, we also need an invariant that enforces the same 
constraints that are checked by the abstract models. The 
invariant Z3 requires that every pending or fetched BD in the 
pipeline do not have update/write back addresses nor DMA 
writes to pending BDs (this includes that pending BDs do not 
overlap; in the definition of Z3, c denotes the concatenation of 
c.u, c.p and c.w): T(m, s, b,c) = 


U (bd.wa U bd.dwa) E [J 


bdEecUII. cf (m,s) bdEIl.cf(m,s) 


bd.ra 


The last invariant restricts Mı to force the DMAC to 
access only readable and writable memory (in the definition 
of Z4, c denotes the concatenation of c.f, c.u, c.p and c.w): 
T(m, s,b, c) = 


U as U U bd.ra U U 


rp” [as]Eb bdEc bdEc.op,op#w 


U as U U bd.wa U 


wi” [as,bs]Eb bdEc 


bd.dra CR A 


[J  bd.dwa CW 
bdE€c.op,opxw 


The instantiation of a given DMAC must satisfy some 
proof obligations, which mainly state that the behavioral and 
projection functions are consistent: 


1) A fetched BD (by [fs]) is the first pending BD: 
If rf [as] = = IL fas(s1), and (s2,(bd,op)) = 
A.f(s1, pf [m{as]]), then there exist BDs bds such that 
Il.cf(m, 81) = bd :: bds. Also, after fetching a BD, the 
projection function must reflect the removal of the BD 
from the pending queue: II.cf(m, s2) = bds. 

2) The queue of pending BDs depends only on the 
locations of the BDs and the internal state: If 
Va € Usaertcf(m,s) 04-ra. mela] = mila], then 
IL.cf (m1, s) = I.cf(ma,s). 

3) The function associated with DMA transfers does 
not affect the queue of pending BDs: (s2,rs,cf) = 
A.p(s1, bd, ps) implies I. cf (m, s2) = I.cf(m, s1) 

The proof obligations of the driver are that it only appends 

BDs and preserves the invariants Z, and Z3. This proof obli- 
gation is only relevant for non-internal CPU transitions, since 
the invariants do not depend on the CPU. For memory writes 
(other cases are similar) this means that if A;c{1, 3} Zi(m, d), 


t(as,b 
cpu ARD cpu’, and as C M then: 


1) 3bds. Il.cf (mjas + bs], d.s) = II.cf (m, d.s) + bds. 


2) (mjd) SUCH (m'|d') implies Z;(m’, d’), where > 


denotes the transition relation of M4. 

That is, writes (updates, write backs and DMA writes) of 
appended BDs do not point to pending BDs or non-writable 
memory, appended BDs do not overlap, and reads (both fetches 
and DMA) of appended BDs do not point to non-readable 
memory. Notice that invariant preservation can be done by 
checking the state of the the more abstract DMAC model Mı, 
disregarding the lower layers. 


C. Refinement and Memory Isolation 


Refinement is phrased as a bisimulation and assumes the 

invariant. For i € {2,3} (~>; denotes the transition relation of 
M;): 
Lemma 1. If Z;,,(m,d), (m,d) işı 
(c,m, d) os, (c’,m’,d’) then exists e' such that (c,m,e) wg 
(c’,m’,e’) and (m, d') ~i41 (m,e), and vice versa with 
transitions of —;. 


(m,e), and 


Proof. Consider i = 1. For the fetch rules the main difference 
between Mı and Mə is that Mı fetches abstract BDs and 
Mə fetches concrete BDs. ~2 guarantees that these queues 
are equal. Mı moves the first BD of dı.c.f to the tail of 
the update or process queue (d,.c.op, op € {u,p}). DMAC 
proof obligation 1) ensures that Mo performs a corresponding 
operation by moving the first concrete BD of II. cf (m, do.s). 

For updating, processing and writing back BDs, the abstract 
pending BDs of M, cannot change by definition. To show that 
the concrete pending BDs of Mə are also unchanged we use 
the the update/write back checks in Mz and DMAC proof 
obligation 2). Moreover, Zə and DMAC proof obligation 2) 
imply that memory writes do not change concrete pending 
BDs in Mg, preserving equality between concrete and abstract 
BDs queues. 

Finally, for CPU transitions, there are two cases depending 
on whether the pending BDs are modified. If not, then memory 
and register accesses have identical effects in Mı and Mə. 
Otherwise, Driver proof obligation 1) ensures that Mə only 
appends BDs. This allows Mı to produce the corresponding 
abstract queue of pending BDs by extending the existing one 
via the rule [ma] (and similarly for register writes). 

For i = 3, Z3 is transferred by ~3 to Mə, implying that 
all checks in My pass (e.g. [w]). Thus, Mz and M3, perform 
identical operations. CPU and DMAC memory transitions are 
identical in Mə and M3. 


We then prove that invariants are preserved and transfered 
by the refinements: 


Lemma 2. If Z;(m,d) and (c,m,d) uy (dc,m',d') then 
Z,(m',d’). Also if 7 < i and (m, dj) ~; (m,dj41) then 
L,(m, dj41) © Z;(m, di). 

Finally we show that DMAC transitions modify and depends 


on only the right regions of memory (where /|,4 is the projec- 
tion of a function over domain A and A is set complement): 
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SOP BD 


DMA Packet 2 


DMA Packet 1 buf,» 


Fig. 4. Organization of BD queues of the USB DMAC. 


Theorem 1. If /\;Z;(m,d) and (c,m,d) +3 (c,m’,d’) then 
My = m'|w, and if mR = milz then (c1, mı, d) 33 
(c1, mi, d’) and m'h B milw 


The theorem follows from Lemmas 1, 2 and by establishing 
a further bisimulation with an even more abstract layer that 
is isolated by construction. This model have additional checks 
compared to M; that prevent adding memory requests to the 
message box that point outside R and W. 


V. USB DMAC 


We instantiate our framework with the DMAC of the USB 
controller of the AM3358 SoC by Texas Instruments [32], 
the SoC on the development board BeagleBone Black [7]. As 
Fig. 4 illustrates, the DMAC organizes BD queues by means 
of two memory regions, one storing BDs (BDRAM) and one 
storing linking information (LRAM), the base addresses of 
which are configurable. Both regions are organized as arrays 
with the same number of entries. To transmit a DMA packet, 
potentially scattered in memory in multiple buffers (e.g., DMA 
packet 1 is the concatenation of buf,;, buf;2 and buf;3), the 
driver initializes in BDRAM one BD for each buffer (BD11, 
BDj»2 and BDj3), linking them via the next descriptor pointer 
in the order the data buffers shall be transmitted to the USB 
device. The first BD of a packet is called Start Of Packet 
(SOP). The LRAM is used to link packets: if BDRAM[i] is 
a SOP then LRAM{[i] links the SOP BD of the next DMA 
packet (BD, is linked to BDə via LRAM entry LE;;, in 
effect linking DMA packets 1 and 2). Both the driver and the 
DMAC read and write BDRAM, but only the DMAC uses 
LRAM. 

To enqueue a DMA packet the driver writes the address of 
its SOP BD (e.g., BDz to enqueue packet 2) to the enqueue 
register Q. This write causes the DMAC to append the BDs of 
the new DMA packet to the pending queue: The LRAM entry 
of the previous tail SOP BD (e.g., LE11) is updated with a link 
to the appended SOP BD. Once a BD has been fetched, it is 
processed, without being updated, and finally written back. A 
write back moves the head SOP BD of the transferred DMA 
packet from the pending queue to the tail of the completion 
queue (which is another queue whose links are also stored in 
LRAM). The completion queue is traversed by the driver to 
recycle BDs. The driver does this by reading the C register, 


Fig. 5. State diagram of the USB DMAC instantiation. The transition labels 
denote the rules that cause the corresponding transition. 


making the DMAC return the address of the first SOP BD 
in the completion queue, and read LRAM to find the next 
completed SOP BD which now becomes the first SOP BD in 
the completion queue. 

We focus on the instantiation of the transmission channel, 
since reception is similar. The internal state is a record s = 
(r, hp, tp, hc, tc, t) containing the registers r (except Q and C 
which are not physical registers), the head and tail pointers 
of the pending and completion queues hp, tp, hc, tc, and the 
state t of the automaton in Fig. 5 that keeps track of the state 
of the operation of the current DMA packet in transfer. 

In state f, the rules [f;] and [f3] fetch the next BD and 
move it to the process queue c.p (BDs are fetched atomically, 
making [f2] unnecessary; thus, A.f always returns a BD). 

In state pi, [pf] repeatedly obtains memory read requests 
and handles replies until all data in the buffer has been read. If 
the BD in c.p is not the last BD of the DMA packet (e.g. BD; 2) 
then [p+] sets the next state to f to operate on the next BD 
of the DMA packet (e.g. BD13). Otherwise, after processing 
the last byte of the buffer, a further application of [pr] is used 
to produce a DMA read request needed to read the LRAM 
entry of the SOP BD (LE; ) of the DMA packet in transfer. 
This data is needed later to update the linking ram in the write 
back stage and must be read by [p], since [w] cannot read 
memory. The state is set to po, in which [p+] processes the 
reply containing the LRAM entry and sets the state to w4. 

Write backs are performed in two steps. First, in state w4, 
[w] updates the head pointer hp of the pending queue to the 
address of the next SOP BD (BDz), which has been previously 
retrieved in p2. Second, in state wo, the tail pointer tc of the 
completion queue is set to the address of the completed SOP 
BD (BD, ); the LRAM entry of the previous tail SOP BD of 
the completion queue is now linked to the new tail (completed) 
SOP BD (e.g. BD); the next state is f to fetch the next 
SOP BD (BD»2); and all BDs accumulated in c.w are released, 
meaning that the driver can reuse them. 

Register accesses are performed by directly reading and 
writing s.r, except Q and C. When Q is written, tp is updated to 
the written address of the appended SOP BD. When C is read, 
the value of hc is returned with hc set to the address of the 
next SOP BD in the completion queue. These register accesses 
cause additional DMA management accesses to LRAM in 
order to reflect the queue updates (e.g., linking LE;; to LE, 
when BD» is written to Q). 

The following is a description of II.cf, and why Il. cf, A.f 
and II.fas satisfy DMAC proof obligation 1). II.cf(m, s) finds 
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Aspect NIC DMAC w/o fw | USB DMAC w/ fw 
LoC model 1500 2000 
LoC verification | 55000 2000 


Modeling time 
Verification time 


3 person-months 
9 person-months 


TABLE III 

EFFORT OF VERIFYING MEMORY ISOLATION OF A NIC [33] AND A USB 

DMAC WITH AND WITHOUT THE FRAMEWORK. THE HOL4 EXPERIENCE 

BEFORE THE NIC DMAC WORK WAS ABOUT FOUR MAN-MONTHS, AND 
ABOUT 30 MAN-MONTHS BEFORE THE USB DMAC work. 


2 person-months 
1/2 person-month 


the unfetched BDs in four steps. (1) It retrieves the address of 
the current SOP BD of the current DMA packet in transfer. (2) 
The address of the next BD to fetch is obtained from hp. (3) If 
hp is zero, then the entire pending queue has been visited and 
the function returns the accumulated BDs so far. Otherwise, it 
collects the unfetched BDs of the current DMA packet, starting 
from the next BD to fetch and traversing the next descriptor 
pointer fields. (4) The next BD to fetch is the SOP BD of the 
next DMA packet, identified by reading the LRAM entry of 
the last visited SOP BD. The procedure continues with step 
3. The first BD that is fetched by A.f is at the address given 
by II.fas obtained from hp, which is the address obtained by 
II.cf in step 2. Hence, the fetched BD is the first pending BD. 


VI. APPLICATION AND EVALUATION 


The framework consists of about 28000 lines of HOL4 
code, including models and proofs. It was first described in 
pseudocode based on reviews of more than 20 DMACs, and 
then refined into HOL4 code [42]. The high-level design, 
definition, and proof took in total 18 person-months. 

The instantiation of the USB DMAC consists of about 2000 
lines for the model, and about 2000 lines for the proofs of the 
proof obligations. The model is based on the informal specifi- 
cation [32], which, as is common with informal specifications, 
contains undefined terms whose meaning must be derived from 
the (lacking) context, dispersed information, and typos. Similar 
to the framework, we started with high-level pseudocode that 
was gradually refined to remove ambuiguities and to make 
it fit the framework, requiring seven person-weeks. Verify- 
ing the proof obligations took about two additional person- 
weeks. In previous work [33], we have modeled and verified 
memory isolation of a NIC DMAC without the support of 
the frawemork, taking about three months of modeling and 
nine months in proving that the invariant is preserved. Due 
to significantly less time in using the framework, we believe 
that the framework provides significant assistance in verifying 
memory isolation of DMACs, with the main benefit being 
the proof of that the invariant is preserved. Table III makes 
a comparison between the efforts invested into verifying the 
NIC and USB DMACs with and without the framework. 

The benefit of our approach is that we can establish sound- 
ness of the verification conditions independently of the driver. 
Then one can independently analyze the driver. For instance, 
the Linux driver of the USB DMAC uses only a limited 
set of the features of the device: It allocates one single BD 
per channel, meaning that the DMA packets consist of only 


one buffer, and it enqueues a new packet only after that the 
previous one has been completed. The driver allocates two 
memory regions for BDRAM and LRAM. These memory 
regions do not overlap, neither do the BDs, with each BD 
of each channel being allocated a fixed location. The Linux 
virtual memory manager allocates the BDRAM and LRAM 
regions, and likewise the DMA buffers for data transfers. 
Assuming that these memory regions are disjoint and located 
in “readable” and “writable” memory, this driver satisfies 
the two driver proof obligations as follows. First, the driver 
pops BDs from the completion queues by reading the C 
register, before reinitializing them and appending them by 
writing the Q register, thus only modifying the pending BD 
queues by appending BDs. LRAM is not accessed by the 
driver. Moreover, by assumption, BDs and DMA buffers are in 
readable and writable memory. The driver organizes the BDs 
in disjoint array slots in BDRAM, meaning that BDs do not 
overlap, and thus write back addresses do not coincide with 
read addresses of other BDs. 


VII. RELATED WORK 


Verification of Device Drivers without DMA Model checkers 
and interactive theorem provers have been used to verify 
various properties of drivers controlling devices without a 
DMAC: Reading from flash memory gives previously written 
data [34]; correct copying of data from memory to an ATAPI 
disk [35]; termination of a UART driver transferring data 
from memory to the external environment [36]; safety and 
liveness properties of a UART driver [37]; absence of data 
races and illegal memory accesses by a keyboard driver [38]; 
and equivalence between abstract and concrete models of an 
SPI driver and the SPI controller [39]. 

These devices do not have a DMAC, meaning that their 
memory isolation depends only on the memory accesses per- 
formed by the driver. For devices without a DMAC, methods 
have been investigated for synthesizing and (semi-) automat- 
ically generating device drivers that satisfy the interfaces of 
the OS and the I/O device [50], [51]. 

Hardware Verification Our work assumes that the the 
hardware implementation of the device satisfies its hardware- 
software interface. Hardware verification is indeed an ortogo- 
nal problem to the driver verification problem. 

A DMAC is reminiscent of a CPU in the sense that BDs 
corresponds to instructions, BD operations correspond to an 
instruction pipeline, and concurrent DMA channels correspond 
to multiple instruction streams (threads) with BDs from dif- 
ferent channels. These aspects have been investigated by the 
CPU formal verification community [52]—[54]. 

Specifically for DMAC implementations, Clarke et al. [40] 
have used model checking to verify that DMAC transfers 
are eventually completed, that the DMAC is eventually ready 
for new transfers, and that memory operations terminate. The 
analyzed DMAC is relatively simple: The DMAC maintains 
no queues nor multiple channels; its configuration depends 
only on the DMAC registers; and the next transfer can be 
programmed only after the previous transfer is finished. The 


125 


same DMAC design was later used to verify relationships 
between signals, including clock cycle delays [41]. 

Verification of DMAC Drivers Monniaux [43] has verified 
a USB driver that controls a DMAC, using a static C code 
analyzer designed to detect memory access and arithmetic 
errors. The driver and the device are modeled in C, with 
interleaved execution. The C analyzer can automatically verify 
that the driver and the controller access only allowed memory. 

Even if an existing C analyzer largely automates veri- 
fication, the framework addresses some of the limitations 
of this work. First, to automate the analysis, the C model 
coarsely overapproximates all possible device actions. In order 
to check soundness of this overapproximation, one should 
refine the model and prove some sort of refinement (see Sub- 
section IV-C), which can be difficult in C and is not supported 
by the tool. Second, the use of a general C verification tool 
requires the model to be defined in terms of C semantics. 
For example, the tool is designed for 32-bit atomic variable 
accesses, but some devices may use single byte granularity. 
Third, it is not clear if the tool can analyze models of DMACs 
that have complex BD queues. In fact, the analyzed model 
has a relatively simple structure, where BD queues consist 
of three static arrays. Finally, the overapproximation used to 
automate the analysis may prevent it from being used to verify 
functional properties (e.g., a buffer is actually copied from 
source to destination), which the tool has no support for. 

Donaldson et al. [46] have used model checking to verify 
absence of data races to DMA buffers between the PPE (a 
general CPU) and SPEs (HW accelerators) of the IBM Cell 
BE processor, which have embedded DMACs in the SPEs to 
transfer data between main memory and their local memory. 
In their analysis, BD queues are not considered, only single 
atomic DMA commands. Hence, this work is limited to this 
specific hardware and does not consider memory isolation. 

Schwarz et al. [47] have used Coq to model a DMAC and 
a hypervisor, which virtualizes the DMAC among two guests, 
and verified that the DMAC virtualization keeps the guest 
isolated. Also this work concerns a specific and simple DMAC, 
not dealing with complex organizations of BD queues. 

In previous work [33] we modeled a DMAC of an Ethernet 
NIC in HOL4 and verified sufficient conditions for isolating 
packets in transfer. The BD queues are organized as linked 
lists stored in internal DMAC memory. The formalization and 
verification took about one person-year, the majority of which 
can be saved with the DMAC framework. 

Techniques for Isolating DMACs The ability of isolating 
DMA accesses is fundamental for guaranteeing security of 
entire systems. For instance, the security of several verified 
systems [23]-[30], [48], [49] requires restricted DMA. 

Hardware assisted DMAC isolation uses stand-alone IOM- 
MUs [15] or IOMMU embedded in the DMAC [8] to prevent 
the DMAC from accessing critical memory due to untrusted 
configurations. In absence of dedicated hardware mechanisms, 
the common approach to enforce memory isolation is via 
a monitor in the OS [44], [55] or the hypervisor [45], that 
intercepts driver reconfigurations of the DMAC. Other meth- 


ods analyze an aspect of the system in runtime and react 
to violations: Execution of device firmware follows a pre- 
determined pattern [14] (e.g. the stack pointer and program 
counters are in valid memory regions), memory bus activ- 
ity follows a pre-determined pattern [13], execution traces 
recorded by hardware or binary instrumentation [12], and 
integrity of firmware and I/O configuration (the checks of 
which are triggered by interrupts and thresholds of hardware 
performance counters) [11]. 

Grisafi et al. [56] presents a mechanism to isolate mem- 
ory for low-end embedded systems with DMACs. This is 
achieved by means of a hypervisor, and a compiler that inserts 
hypervisor calls in applications accessing DMAC registers. 
The software design has been verified, however the security 
of the system depends on the fact that the security policies 
enforced by the hypervisor prevent the DMAC to access 
critical region of memory. While this is simple to check for 
simple DMACs with single BDs and that are configured only 
via memory mapped registers, guaranteeing this property for 
complex DMACs requires to analyze the device model. Our 
work is complementary to the software verification, since it 
supports the identification of the verification of the security 
policies for teh devices. 


VIII. CONCLUSION 


We have implemented a framework in the interactive the- 
orem prover HOL4 for modeling DMACs, and by means 
of refinement formally verified DMAC memory isolation. 
Comparing the efforts of the USB DMAC instantiation with 
previous verification of memory isolation of a NIC DMAC 
[33], strongly suggests that the framework can significantly 
reduce the cost of verification of isolation (i.e., proving that 
the invariant is preserved). 

Our verification can be extended in two directions. Towards 
software, the proof obligations can be used to check that 
device drivers securely configure DMACs or to synthesize 
security monitors, and the abstract model can be used to check 
functional correctness (e.g., transmission of network packets). 
Towards hardware, the model M3 can be used to either show 
that a formal hardware design respect the specification, or for 
model driven testing of closed source hardware. 

We plan to implement and model a monitor that runs 
underneath the Linux USB DMAC driver for the USB DMAC 
on BeagleBone Black [7], [32], checking that the driver 
reconfigurations are secure; and then verify that the monitor 
satisfies the proof obligations. This fulfills two goals: The 
monitor preserves security even if the driver is buggy, and 
the monitor itself can be used to detect if the Linux driver has 
memory isolation bugs. 
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Abstract—Program analyses based on Instruction Set Architec- 
ture (ISA) abstractions can be circumvented using microarchitec- 
tural vulnerabilities, permitting unwanted program information 
flows even when proven ISA-level properties ostensibly rule them 
out. However, the low abstraction levels found below ISAs, e.g., 
in microarchitectures defined in hardware description languages, 
may obscure information flow and hinder analysis tool develop- 
ment. We present a machine-checked formalization in the HOL4 
theorem prover of a language, MIL, that abstractly describes 
microarchitectural in-order and out-of-order program execution 
and enables reasoning about low-level program information flows. 
In particular, MIL programs can exhibit information flow side 
channels when executed out-of-order, as compared to a reference 
in-order execution. We prove memory consistency between MIL’s 
out-of-order and in-order dynamic semantics in HOL4, and 
define a notion of conditional noninterference for MIL programs 
which rules out trace-driven cache side channels. We then 
demonstrate how to establish conditional noninterference for pro- 
grams via a novel semi-automated bisimulation based verification 
strategy inside HOL4 that we apply to several examples. Based 
on our results, we believe MIL is suitable as a translation target 
for ISA code to enable information flow analyses. 

Index Terms—information flow, interactive theorem proving, 
HOL4, microarchitectures, out-of-order execution 


I. INTRODUCTION 


Vulnerabilities such as Spectre, Meltdown, and Fore- 
shadow [1]-[3] demonstrate that program analyses based on 
Instruction Set Architecture (ISA) abstractions cannot guar- 
antee important program properties such as freedom from 
unwanted information flows. Consequently, microarchitectures 
(residing below the ISA level) are important to understand and 
take into account by developers of compilers and program 
analysis tools. However, the low abstraction level of most 
hardware description languages (HDLs) obscures important 
microarchitectural features such as out-of-order execution of 
program instructions. In particular, HDLs complicate reason- 
ing about low-level program information flows. 

To address this problem, Guanciale et al. [4] proposed the 
Machine Independent Language (MIL), which abstractly de- 
scribes microarchitectures and permits analysis of information 
flows between microinstructions. In this paper, we present a 
deep embedding of MIL and an encoding of its out-of-order 
(OoO) and in-order (IO) dynamic semantics in the HOL4 
theorem prover [5]. Using our embedding, we formalize two 
key aspects of the metatheory of MIL. Firstly, we provide, 


This work has been partially supported by the KTH CERCES Center and 
the Trustfull project funded by the Swedish Foundation for Strategic Research. 


&) https://doi.org/10.34727/2022/isbn.978-3-85448-053-2_19 


to our knowledge, the first general machine-checked proof 
of memory consistency between in-order and out-of-order 
execution of microinstructions. Secondly, we define a notion 
of conditional noninterference (CNI) capturing trace-driven 
cache based information flow [6]. To achieve this, we clarify 
the assumptions under which MIL programs (1) do not go 
wrong during runtime, and (2) progress as expected, which 
was previously left implicit. 

We show that out-of-order execution can introduce infor- 
mation side channels, by exhibiting a violation of conditional 
noninterference. We then devise a semi-automated bisimu- 
lation based strategy to verify conditional noninterference, 
which we apply to several example MIL programs. To im- 
prove automation of conditional noninterference proofs, we 
developed functions and results for verified execution of MIL 
instructions inside HOL4 [7]. We also refined our functions 
to CakeML code [8], which, when compiled to native code, 
can execute instructions orders of magnitude faster than HOL4 
and demonstrate side channels for concrete MIL programs. 

In order to make our theory and tools applicable to a 
range of real-world ISAs such as ARMv8-A and RISC-V, we 
developed a translator from BIR, an architecture independent 
binary code representation from the HolBA binary analysis 
framework [9] that has proof-producing lifters. To validate the 
MIL formalization, we analyzed both hand-crafted programs 
and programs translated from BIR. Based on our results, 
we believe MIL is ready to be used as a form of abstract 
microcode language, e.g., as a target language for ISA instruc- 
tions to enable low-level information flow analysis. From the 
hardware perspective, our memory consistency proof for MIL 
can be reused across different formalized microarchitectures. 


In summary, we make the following contributions: 


e Foundations: We define MIL and its dynamic OoO 
and IO semantics in HOL4, including notions of well- 
formedness and resource initialization for runtime states. 

e Metatheory: We develop formal metatheory of MIL in 
HOL4, including a proof of memory consistency and a 
notion of conditional noninterference for the semantics. 

e Tools: We verify functions for executing MIL programs 
and then refine them to CakeML, yielding trustworthy 
MIL analysis tools both inside and outside HOL4. 

e Applications: We devise a semi-automated reasoning 
strategy for conditional noninterference, which we apply 
to verify confidentiality of several MIL programs. 
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Fig. 1. Comparison of abstraction levels of BIR and MIL. 


As supplementary material for the paper [10], we provide 
the HOL4 definitions and proofs, Standard ML code, CakeML 
programs, and a technical report that renders key MIL defini- 
tions and results into readable mathematical vernacular. 


II. BACKGROUND 
A. Instruction pipelining and OoO execution 


Pipelined processors divide instruction execution into 
stages, such as fetching and decoding, which are carried out 
by distinct processing units working independently. However, 
a programmer or compiler developer typically assumes instruc- 
tions are processed and completed in sequential order, which 
may cause pipelines to stall while a processing unit is waiting 
for instructions to complete the previous stage. By extracting 
data and address dependencies between instructions, microar- 
chitectures can reorder instructions, leading to better pipeline 
utilization and performance [11], [12]. 

Instruction reordering, and thus OoO execution, is a funda- 
mental microarchitectural mechanism that can be leveraged in 
isolation to increase performance of pipelined processors [13]. 
It is also a prerequisite of speculative execution, where instruc- 
tions are fed into a pipeline even when they are not known to 
be necessary to execute. Our formalization of the foundations 
of OoO execution is therefore an important building block 
towards machine-checked analysis of speculation in MIL using 
the speculative MIL semantics by Guanciale et al. [4]. 


B. HolBA and BIR 


HoIBA is a binary analysis platform based on HOL4 with 
support for ISAs such as ARMv8-A and RISC-V [9]. HolIBA 
provides proof-producing transformations of binaries to an 
intermediate HOL4 representation, called BIR. That is, HoIBA 
generates a HOL4 theorem that the BIR representation of an 
input binary preserves its behavior, as given by a formalization 
of the corresponding ISA [14]. BIR is also the target language 
of Scam-V, a toolchain which finds discrepancies between ab- 
stract information side-channel models and real microarchitec- 
tures [15]. Figure 1 illustrates the intended abstraction levels of 
BIR and MIL compared to some real-world counterparts [16], 
[17]. However, MIL elides many microarchitectural features 
not relevant to information flow. 


III. SYNTAX AND SEMANTICS OF MIL 


In this section, we present the syntax of MIL and its OoO 
and IO dynamic semantics. The presentation largely follows 
Guanciale et al. [4], but we highlight key differences and 
additions due to the formalization in HOL4. Informally, MIL 


is a single static assignment (SSA) language [18], where 
variables in an assignment are unique microinstruction names. 
Ultimately, a MIL program, if it terminates successfully, 
computes a set of assignments of 64-bit values to such names. 
We assume that names are totally ordered, which induces an 
order on instructions via their assigned names that we call the 
program order. A program can thus be given as a linear list 
of guarded assignments to variables. 

Example 1: We use the small parameterized MIL program 
below as a running example. The program compares the con- 
tent of the register reg to 1 and sets the program counter (PC) 
to the memory address adr if this is the case, or increments 
the current PC value by 4 otherwise. It thus implements a 
high-level conditional branch on equal (beq) instruction. 


tb0 := true ? 0; // zeroed name for PC load/store 
tbl := true ? reg; // get register identifier 

tb2 := true ? load(REG, tbl); // load register value 
tb3 := true ? tb2 == 1; // is the register value 1? 
tb4 := true ? load(PC, tb0); // load PC value 

tb5 := true ? adr; // get memory address 

tb6 := tb3 ? store(PC, tb0, tb5); // store to PC 

tb7 := true ? tb4 + 4; // increment PC value by 4 
tb8 := !tb3 ? store (PC, tb0, tb7); // store to PC 


To obtain a fully defined (“ground”) MIL program, the assign- 
ment variable names (tbX) must be replaced by non-negative 
integers, and the parameters reg and adr must be replaced by 
64-bit words. We usually use variable name suffixes to indicate 
desired integer ordering, e.g., toO < tb6. Subsequently, we 
will omit true guards, e.g., we will write toO := 0. 


A. Abstract Syntax 


In Figure 2(a), we define the abstract syntax of MIL. 
Names. We use unbounded HOL4 natural numbers as mi- 
croinstruction names t, and predicate sets [19] for collections 
of names N. This approach theoretically permits infinite 
sets which are not meaningful in our context, but allowed 
easy transcription of set-related definitions from the original 
definition of MIL. 

Values. Values v (and a) are 64-bit words encoded in the 
usual way for HOL [20]. The constant values false, true, and 
O are defined according to conventions of the C language. 
Besides finiteness and distinctness of false and true, the MIL 
metatheory (in contrast to the tools and examples) does not 
rely on anything specific about the word size. 

Expressions. Expressions e (and c) are side-effect free and 
are assumed to include at least names and values. However, as 
long as requirements on the semantics of expressions (outlined 
in Section II-B) are met, expressions can be arbitrarily added 
to MIL without affecting the metatheory. In our HOL4 encod- 
ing, we defined expression syntax and semantics to match the 
BIR language, streamlining the translation from BIR to MIL. 
Resources and operations. MIL operations are defined on a 
resource T, which is either the PC, memory, or a register. An 
operation o is either an expression, or a load or store on a 
resource. Since there is a single PC resource, PC loads and 
stores are intended to take a name as first argument that is 
assigned to the value 0; this is implicit for Guanciale et al. 
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Microinstructions. A MIL microinstruction 1, or instruction 
for short, is an assignment of a name t to (the result of) an 
operation o, guarded by an expression c. A single higher-level 
instruction, e.g., at the ISA level, will typically be represented 
by many MIL instructions, which is why MIL is parameterized 
on a translation function explained in Section II-B. 


B. Runtime States and Semantic Definitions 


To provide dynamic semantics for MIL, we define runtime 
states for programs; Figure 2(b) lists the basic syntax we use. 
Programs. MIL programs J are predicate sets of instructions. 
Whenever convenient, we consider instructions in J in pro- 
gram order (using the assigned instruction names). 

Stores. Stores s are finite maps from names to values, where 
dom (s) is the set of names that are mapped by s. We write 
s(t) | (s(t) t) for t € dom (s) (resp. t ¢ dom (s)). 

States. In addition to a program J and store s, a MIL runtime 
state ø contains two sets of names C and F that respectively 
track whether an associated instruction has been committed to 
memory or its successor instructions have been fetched. 
Observations. An observation obs is either the silent obser- 
vation €, a data load dl, a data store ds, or an instruction load 
il. The three latter include a memory address value. 
Actions. Actions a represent transitions. Instructions are first 
executed (EXE). Then, if an instruction is a memory store, it 
can be committed (CMT), or, if it is a PC store, it can cause 
the next instructions to be fetched (FTC). 

Labels. In contrast to Guanciale et al., transition labels | 
contain not only observations, but also the action performed 
by the transition and the name of the instruction for which the 
action was performed. 

In abstract syntax, the program in Example 1, which we 
abbreviate Ibeq(reg, adr), is written: 


tho <— 0, tht <— reg, tho <— Id R ty1, th3 &— tbe == 1, 
tha <— ld PC tbo; ths <— adr, the <— th3 2st PC tbo tbs, 
th7 < tha + 4, tog <-!tp3?st PC tbo tb7 


Executing the last instruction in Jpeq is represented by a label 
(il (peg +4), FTC (I), tes), where pco is the original PC value 
and J is the translation of the program at pcg + 4. 

Bound and free names. For an expression e, its set of names 
n(e) is defined recursively on the structure in the obvious 
way. An instruction ų¿ has a bound name, written bn (v), and 
a set of free names, written fn (1); the set of all names in ¢ is 
written n (vc). The set of all bound names of instructions in a 
program I is written bn (I). In addition, n (l) yields the name 
in the label l. For example, if 1 = tbs < th3?st PC tpo ths, then 
we have bn (4) = tye and fn (v) = n(tb3) U n(st PC tho ths) = 
{ tbo, tbs, tos}, so n (1) = { tho, ths, ths, toe - 

Semantics of expressions. The semantics of an expression 
e in store s is given by a partial function returning a value 
v, which we write [e]s = v. If the function is (un-)defined, 
we write [e]s | (resp. [e]s 1). We do not define an explicit 
canonical function for the semantics of expressions, since 
it is microarchitecture dependent. However, in contrast to 
Guanciale et al., we impose requirements on such functions: 


1) [e]s ļ if and only if n (e) C dom (s). 

2) If s(t) = s’(t) holds for all t € n (e), then [e]s = [e]s’. 

3) For all v and s, [v]s = v. 
For validation, we implemented a function consistent with 
BIR semantics, where for example e + e’ is evaluated using 
word_add from the HOL4 word theory. Given a store s, an 
expression c evaluates to a true guard condition, written [c]s, 
whenever there exists v such that [c]s = v and v F false. 
Address and resource of store or load. Given the name t 
of a store or load instruction in a program, we need to be 
able to obtain the resource and the name of the instruction 
that computes the address that t targets. We therefore define 
the partial function addr so that addr(I,t) = (r,t) if 
t<¢ cildrt’ € Io t 4+ c?strt't” e I. For instance, 
addr (Ibeq, tb2) = (R, th1) for the example program. 
Store may and store active. To handle store-to-load depen- 
dencies we define two auxiliary functions str-may(o,t) and 
str-act(o,t) that determine, for a given load instruction t and 
state ø, respectively, a) the set of stores 1 = t’ + c'?stT ty tz 
that may by further instantiation of names smaller than ¢ assign 
to the (possibly as yet unknown) load address to of t, and b) 
the set of stores ¿ in str-may(o,t) that cannot be eliminated 
due to another store t” : t < t” < t overwriting either the 
store address tı of t’ or the load address to of t. Formally: 


1 € str-may (0, t) ifft < t A ([e]s V [c']s PA 
(s(t1) = s(to) V s(t1) T V s(to) T) 

L € str-act (o, t) iff v € str-may (o,t) A 
t” & Istr ty th € str-may(o,t) At” > t'A 
[c"]s => s(t) # (to) A s(t) # s(t) 

Example 2: The MIL program below loads the register rı 
from the memory address b4, copies the value of rı into rq if 
the flag in register z is set, saves the result into the memory 
address b2, and then increments the PC by 4. At a high level, 


the program thus implements conditional copying of memory 
on equal, and we refer to it as Iceq(b1, b2). 


teog := 0; tcOl := rl; tcO02 := r2; 

tc03 := z; tc04 := bl; tc05 := b2; 

tcll := load(MEM, tc04); // [lof2] rl := *bl 

tc12 := store(REG, tc01, tcll); // [2of2] 

tc21 := load(REG, tc03); // [lof3] cmov z, r2, rl 
te22 := tc21 == 1 ? load(REG, tc01); // [2of3] 
tc23 := tc21 == 1 ? store(REG, tc02, tc22); // [30f3] 
tc31 := load(REG, tc02); // [lof2] *b2 := r2 

tc32 := store(MEM, tc05, tc31); // [20f2] 

tc41 := load(PC, tc00); // [lof3] pe := pe + 4 
tc42 := tc41 + 4; // [20f3] 

tc43 2= store (PC, tc00, teta // [B0f3] 


We assume that Keq (b,,b2) runs after another initialization 
program Jo, i.e., that o = (Ip U Iceq(b1, b2), s, C, F) and all 
instruction names in Ip are before too. 

Suppose that in the state o, we have s(teo0) t,.--, 8(tca3) 7- 
Then, str-may(o,t.31) contains all register stores coming 
before ¢.31, since the load address of t.31 is undefined and 
any previous register store instruction can potentially affect 
the loaded value of te31. 
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names (set) 
value (word64) 


N,C,F = {t1, te,.. } 


v,a ::= false | true | O|... 


e,cu=vult|te|lete’ | expression 

T= PC | R | M resource 
o::=e | ldrt]| strtt operation 
instruction 


l ::= t 4+ cro 


(a) 


P= {t1, t2,...} program (set) 
si [t1 > v1, t2 vVo,...] store (fmap) 
o ::= (I, s, C, F) state 

obs :=€ | dla | dsa | ila observation 
a ::= EXE | CMT(a, v) | Fre(7) action 
l ::= (obs, œ, t) transition label 


(b) 


Fig. 2. MIL abstract syntax (a), and syntax used for MIL runtime state and executions (b). 


Suppose ø is the state after the execution of all instructions 
in Jj and the instructions on the first line, i.e., s(teo0o) = 
0,..., 8(tco5) = b2. Then, str-may (c, t-31) contains all stores 
in Jo that update r2, as well as the store te23. str-may (Ø, te31) 
does not contain t.12, since the destination register of te12 (r1) 
differs from the source register of te31 (r2). 

Suppose ø is the state after the execution of all instructions 
until t22, and let s(te21) = 0. Then, str-may (c, te31) does 
not contain te23, since the guard condition of t.23 is false, and 
therefore the store will not be executed. Hence, str-act (ø, te31) 
contains the last instruction in str-may (Ø, t¢31) not overwrit- 
ten by a subsequent store. However, if s(t.o1) = 1, then 
str-may (G, t¢31) contains te23 and str-act (0, tc31) = {te23}. 
Semantics of instructions. The semantics of instructions is 
given by a partial function taking an instruction z and state o 
and returning a value and an observation, which we write as 
l]a = (v, obs). We define the function by case analysis on v. 

e [t + c?elo = (v,€), if [e]s = v. 

e [t < clldrt']o = (v,dla), if bn (str-act(o,t)) = 

{e}, s(t’) =a, s(t”) = v, T= M, and t” € C. 
e [t | ldr t']o = (v,e), if bn (str-act (o, t)) = {t"}, 
s(t’) = a, s(t”) = v, and either 7 4 M or t” ¢ C. 

e |t 4} c?stT t talo = (v, €), if s(t) = v and s(t2) J. 
Completed microinstructions. To guarantee progress during 
execution of MIL programs, we provide a different criterion 
than Guanciale et al. for instructions to be completed. Specif- 
ically, we define ¿ as completed in a state o = (I, s, C, F), 
written C (ø, 1), whenever 


e L= t4 c?st M t tz and either [c]s = false or t € C 

e = t4 c?st PC tı to and either [c]s = false or t € F 

e ¿= t + co, and either [c]s = false or t € dom (s). 
For example, if the value of reg is 1 in Example 1, then after 
tb3 + tho == 1 has been executed (mapping tp3 to true), 
instruction tpg becomes completed, since its guard is false. 


C. Transition Step Relations 


We define two dynamic semantics of MIL in the structural 
operational semantics style: an OoO semantics and an IO 
semantics. Specifically, we define, by the rules in Figure 3, 
the labeled OoO transition step relation, o +, g! , and the 
labeled IO transition step relation, o 4 ø”. 

Oo0O-Exe. This rule computes the value v of an instruction 
with bound name ¢ and records the result in the store by adding 


the mapping |t +> v]. This uses the semantics of instructions, 
and therefore relies on most functions above, such as str-act. 
OoO-Ftc. This rule fetches an already-executed PC store 
instruction, which potentially adds more instructions to the 
program in the state. Intuitively, the function translate(a, t) 
used in the rule looks up the code at the data area address a 
and generates the corresponding MIL instructions using names 
greater than t. Fetches thus enable MIL programs to have 
iterative and possibly diverging behavior. 

OoO-Cmt. This rule commits an already-executed memory 
store instruction to memory. Both the memory address a and 
the new value v are part of the label’s action, while only the 
former is included in the observation. 

10-Step. This rule processes instructions using the OoO rules, 
but deterministically following the program order. 

For instance, in an initial state for Jbeq, OoO-Exe transitions 
are enabled for the instructions for tyo and t 1. However, if 
tbo < tp1 as expected, only to is enabled for an IO-Step 
transition, i.e., the instruction for tpọ must be completed before 
the instruction for tp1. 

The OoO semantics can be viewed as abstracting the 
behavior of a pipelined single-core microarchitecture which 
receives CISC-like ordered program instructions, and then 
translates each such instruction into one or more RISC-like 
microinstructions which are nondeterministically executed and 
(possibly) completed. For instance, the OoO semantics is rem- 
iniscent of the NetBurst microarchitecture used in Pentium 4 
processors [16]. In contrast, the IO semantics is more akin to 
abstract ISA behavior, where execution must always proceed 
according to an order specified by a programmer or compiler. 
Silver is an example where the microarchitecture itself behaves 
similarly to the MIL IO semantics [17]. 


IV. METATHEORY OF MIL 


While a MIL program has no canonical initial state at 
runtime, we define in this section a notion of state well- 
formedness that, intuitively, ensures program execution does 
not go wrong. However, well-formedness does not by itself 
guarantee progress, e.g., that execution will end up in a state 
where all instructions are completed. For progress, we define 
resource-initialized states, which prevent instruction execution 
from getting stuck. By comparison, the MIL semantics of 
Guanciale et al. [4] did not explicitly account for progress 
and only ruled out some forms of malformed states. 
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te clo el s(t)t — [els 
[t + c?ol(I, s, C, F) = (v, obs) 


(I, s, C, F) {2E (I s+ [t> v], C, F) 


t cistPChth Eel tF s(t)=a 
translate (a, max (bn (L))) = I’ 
bn (str-may ((I,8, C, F),t)) C F 


(I, s, C, F) oF), (TUT, s, 0, Fu {é}) 


O0O-EXE 


OoO-FTC 


obs,a,t 
g tsat, of 


Ve € o.if bn (et) < tthenC (0,1) 


F (obara) a! IO-STEP 
te cist Mt t ET t ¢ C 
s(t)} s(t)=a s(t)=v 
bn (str- I,s,C,F),t))CC 
n (str-may ((I,8, C, F), t)) re 


(I, s, C, F) L5 (I, s, CU {t}, F) 


Fig. 3. OoO and IO labeled transition step relation rules. 


Assuming well-formedness and resource initialization en- 
abled us to prove in HOL4 the memory consistency of the 
OoO and IO semantics of MIL, fully elaborating the previous 
pen-and-paper reasoning and filling in all the gaps. We believe 
this puts our notion of conditional noninterference for MIL 
on firm ground. In turn, this notion allows us to reason about 
information flow in MIL programs, as described in Section VI. 


A. Well-formedness of States 


We now define the requirements for a state o = (I, s, C, F) 
to be well formed. Properties 1 to 8 below are the basic 
sanity properties that, e.g., express the absence of dangling 
instruction references, that instruction dependencies form a 
directed acyclic graph, and that instruction execution is prop- 
erly recorded in the store and elsewhere. For instance, in states 
including Jpeg, property 2 requires that tp1 < tp2, and property 
3 forbids having tbo = tho. 

1) I is a finite set such that C U F C dom (s) C bn (J). 

2) Ife € I, t € fn(v), then t < bn (ce) and Je’ € I such 

that bn (v’) = t. 

3) Ifvee I, V €I, and bn (1) = bn (v’), then s =v’. 

4) If t € C, then bn (str-may (ø, t)) C C and 3+ € I such 

thate = t + c?st M t tz. 

5) If t € F, then bn (str-may (c, t)) C F and de € I such 

that, = t + c? st PC h tz. 

6) Ife E J fore = t 4+ c?st PC t ta or t + Uld PC th, 

then tı + true?0 € I. 
7) Ife € I fore. = t + c?stT ti ta, and s(t) = v, then 
s(tı) | and s(t) = v. 

8) If t4 cle € I, s(t) = v, then [t + c?eļo = (v, e€). 
Properties 9 to 11 below next ensure that guards behave as 
expected and do not block execution. For instance, property 9 
says that if tbs in Jbeq has a stored value, then ¢p3 has a value 
stored which is not equal to false. 


9) If t< c?o € I and s(t) J, then [c]s. 


10) If t 4+ clo € I, t 4+ c'?o' € I and t € n(c), then 
c’ = true. 
ll) If t o € I, t d?o € I, t € n(o), [d], 


and |c']s" = v’, then v’ F false. 

Finally, we impose analogous properties for output from 
the translate function; motivation and details are in the 
supplementary material [10]. In lieu of subject reduction for 
an explicitly typed language, we then proved in HOL4 that 
well-formedness is preserved by all the OoO and IO transition 


rules whenever translate returns output with the required 
properties. In particular, the proof relies on that t < t 
whenever ¿ € translate(v,t) and t' € n(z). From now on, 
we always assume that states are well formed. 


B. State Resource Initialization 


Consider a load instruction in a state, e.g., the instruction for 
tp2 in a state whose program includes Jpeg. Intuitively, during 
an Exe transition for tp2, the previous value for the register 
reg is copied to the store, which is done by finding the last 
completed store instruction on reg. However, if there is no 
such store instruction, tp2 can never be completed. 

To address this problem in the MIL metatheory of Guan- 
ciale et al. [4], we introduce a notion of resource initial- 
ization for states. Specifically, we say that the predicate 
initialized-resource-set(o,T,V) is true precisely when, for all 
v € V, there exists a completed store instruction for v and 
T in ø such that there is no earlier load instruction in o 
for v and T. We then say that, in resource initialized states, 
initialized-resource-set holds for all possible values when 
T = R or tT = M, and for 0 when 7 = PC. 

For example, in a well-formed resource initialized state 
o = (I, s, C, F) whose program includes beq, we know there 
exists an instruction t 4 c?st R t't” such that t € dom(s), 
s(t’) = z, and t < tp4, ensuring that we can complete the load 
instruction tp4 with an Exe transition. 


C. Executions, Commits, and Traces 


We define MIL executions formally as (bounded) lists 7 of 
state-label-state triples (ø, l,o’). More specifically, for m to 
be an O0O execution, it must be non-empty and its triples 
must follow the OoO transition relation, which we write as 


l l ; 
Tt = 0, —» 02 —» 03---. Analogously, when m is an JO 
execution, we write 7 = oj u, 02 2, o3:::. We also 


write m ++ 7’ for the concatenation of two executions. 

For an execution 7 and a memory address a, the function 
commits(m, a) returns a list with the history of values written 
(i.e., sent to the memory subsystem) in 7 to a. We define 
commits by case analysis on the first transition in 7 so 
that, e.g., commits(o, L250) t) s o, 4 m,a) = 
v, commits(n’, a). Finally, the function trace(7) returns the 
trace of the execution m, which is its (possibly empty) 
list of non-silent observations. As one example, we have 


trace(o 827), 6! + 7’) = dla, trace(n’). 
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D. Functional Correctness: Memory Consistency 


Intuitively, two models of program execution are memory 
consistent when they yield the same sequence of memory 
updates for each memory address, which ensures that the final 
result of a program (if there is one) is the same in both 
models. More formally, memory consistency of the OoO and 
IO semantics requires that writes to the same memory location 
are always seen in the same order by an observer, which we 
state and prove as our main theorem. 

Theorem 1: For all well-formed and resource initialized 
states o} and OoO executions 7 = 01 as O2- - , there exists 
an IO execution 7; = 01 ts gh +-+- such that for all (address) 
values a, the list of commits for a in 7 is a prefix of the list of 
commits for a in 7; (and vice versa for IO and OoO execution). 

Proof: The proof relies on a two-step reordering lemma 
which says that if ok A Ok+1 Ay On+2 and n(l’) < n(l), 
then there exists oj,,, and Į” such that op —> Oppi +, 
0%42, where l’ and 1” have the same commits. 

The key steps of the proof are illustrated in Figure 4, and 
can be divided into two parts. The first part establishes, by 
induction on execution length, that for every OoO execution 
there is a corresponding ordered OoO execution, with the 
same initial and final state, and the same order of commits 
per address, where transition labels respect the total order 
on names. In the following, we use î to identify ordered 
OoO executions. Let 7 = To + Ok+1 wey Or+2 be an O0O 
execution; then by induction there is an ordered OoO execution 
io = us + Ok L, Ok+1 Of To. Hence, T = 1 H Ok ERN 
Ok+1 —> Ok+2 is also an O0O execution. If n(l) < nil’), 
then 7’ is an ordered execution of 7; otherwise, the two- 
step reordering lemma guarantees that there exists an OoO 
execution 7” = 71 + ox Jy Tht 1, Or+2. We use 
induction again to show that there is an ordered OoO execution 
ia of õi Hop L> o},44- Clearly, T” = fa +0,1 Ay Ok+2 
is also an OoO execution. Since the label names in 72 are the 
union of the label names in 7 and n(i’), the label names in 
71 are less than or equal to n(l), and if n(i’) < n(l) then 7” 
is an ordered execution of 7. 

In the second part of the proof, we establish that any ordered 
OoO partial execution 7’” can be extended to an ordered OoO 
execution Te where all instructions in the last state of m” have 
been completed and no other instruction has been completed. 
We reuse the above reasoning to guarantee that there is an 
ordered OoO execution 7, of me, with last state o,. Finally, 
we show that if an OoO execution is ordered and its last state 
has an upper bound ¢ such that all instructions with name 
smaller than t have been completed and no other instruction 
has been completed, then this execution is an IO execution. 
Therefore, for every address a, the commits for a in 7 are a 
prefix of the commits for a in the IO execution îe. 

The vice versa case is trivial, since any IO execution is also 
an OoO execution. a 

In summary, Guanciale et al. proved a two-step reordering 
lemma in their paper [4], which we formalized in HOL4 with 
substantial required effort. However, to complete the memory 
consistency proof, we also provide novel formal proofs that 


Fig. 4. Illustration of key steps in the memory consistency proof. m is the 
given OoO execution, o1 is the initial state in 7, me is an extension to a 
state where all instructions in the last state of m (and no other) have been 
completed, and îe is the IO execution with the commits from 7 as a prefix. 


(1) the two-step reordering lemma implies the existence of 
an ordered OoO execution, (2) any OoO execution can be 
extended to complete all currently incomplete instructions, 
and (3) an ordered and completed OoO execution is an IO 
execution. These three properties were previously only hinted 
at and not formally stated or proved. 

On one hand, memory consistency for the OoO semantics 
expresses that, subject to the conditions given by the seman- 
tics, executing instructions out-of-order is always correct. On 
the other hand, memory consistency provides a useful formal 
verification aid: to show that a real out-of-order processor 
pipeline satisfies memory consistency, it suffices to show that 
its design is simulated by the OoO semantics, without any 
need for dealing explicitly with instruction reordering. In prac- 
tice, this requires demonstrating that processor scheduling is 
equally or more restrictive than MIL’s conditions on resource 
loads and memory commits. 


E. Confidentiality: Conditional Noninterference 


In order to reason about information leaks via cache-based 
side channels transparently without an explicit cache model, 
we assume that the attacker can observe the address of a 
memory load (dl a), the address of a memory store (ds a), as 
well as the value of the program counter (il a). This approach 
makes the attacker more powerful than in many real-world 
scenarios, but is common in analysis of microarchitectural 
vulnerabilities [21] and for verifying constant time implemen- 
tations [22]. In particular, the approach allows us to describe in 
a simple way, devoid of details on caches, when two states are 
indistinguishable by an attacker according to a given labeled 
state transition relation (for MIL, the OoO and IO relations). 

Definition 1: States a, and og are trace-indistinguishable 
for a labeled state transition relation T, written 0, ~7 02, 
if for every T-execution 7 starting in oj, there exists a T- 
execution 7 starting in o2 such that trace(71) = trace(m2), 
and ~r is symmetric. 
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In the following, we assume a binary relation on states, ~y, 
which we call the security policy. The security policy specifies 
the parts of the program state that contain sensitive/high 
information and the parts that contain public/low information; 
if states are related by ~z, then this means they have the same 
public information and therefore cannot be distinguished by 
the attacker prior to execution. We usually assume that the 
attacker knows the executing program, which means that ~» 
also constrains the current set of instructions to execute and 
future instruction fetches. Moreover, ~+ usually requires states 
to be initial for the program under analysis: i.e. no instruction 
of the program has been executed. We take the IO semantics as 
a specification of the permitted information flows, and consider 
a program secure if an OoO execution of the program does 
not leak more information than its IO execution. The following 
definition formalizes this intuition: 

Definition 2: A system is conditionally noninterferent with 
respect to the security policy ~p, written CNI (~+), if it holds 
that ~e N ~1Io C “ooo. 

Unfortunately, conditional noninterference does not hold 
in general—execution of a program according to the OoO 
semantics can introduce new side channels. More specifically, 
there is a resource-initialized and well-formed state o and 
policy ~¢ such that CNI(-~¢) is false. 

We demonstrate the CNI violation using a state with the 
program Keq(b1, b2) from Example 2. IO execution of the 
state always produces the trace [dl by, ds bo, il (pcg + 4)). 
When the flag in the register z is 1, OoO execution produces 
one of three traces, due to the possibility of fetching t.42 
independently of the memory operations and the fact that the 
memory store must follow the memory load to respect the data 
dependency introduced by te23, i.e., [dl b1, ds be, il (pco + 4)], 
[dl b1, il (pcg +4), ds b2], and [il (pcg +4), dl bı, ds b2]. On the 
other hand, the trace [ds b2, dl bı, il (pcg +4)] is only possible 
if the flag z is not 1, since then the memory store can be 
reordered ahead of the memory load. By observing such traces, 
the attacker learns the flag in z. 

For this counterexample, the security policy ~¢ requires 
the programs of the two initial states to have the shape 
To U Iceq(b1, b2) and T6 U Iceq(b4,, b5), where I and Jj set the 
initial values of the resources accessed by Iceg, and requires 
the instructions in Jeq to be undefined in the initial stores. In 
this case, CNI (~¢) can be proved only if ~¢ also constrains 
Io and Jj to set the same initial value for z. Intuitively, this 
corresponds to considering z to be known by the attacker 
before program execution or to declassifying its value. 

Due to the possibility of confidentiality violations, we 
develop a semi-automated strategy in Section VI to verify 
conditional noninterference of a given program. 


V. TOOLS FOR ANALYSIS OF MIL PROGRAMS 


A. Computing Executions and Traces Inside HOL4 


Formalizing sets of instructions and names as HOL4 predi- 
cate sets was convenient for abstractly defining MIL and devel- 
oping its metatheory. However, this encoding prevents many 
definitions from being computable, which is a prerequisite 


for translation to CakeML. To obtain computable definitions, 
we introduced a refined runtime state (i, s,c, f) that replaces 
all sets with polymorphic lists. We then developed list-based 
analogues of the semantic definitions in Section II-B, such 
as addr and str-act, and proved that they preserve set-based 
behavior, assuming that names of instructions in 2 are unique. 

Using our list-based semantic definitions, we developed a 
HOL4 function for running MIL refined runtime states and re- 
turning executions, dubbed io-bounded-execution. Besides the 
initial state, the function takes an instruction offset argument 
and a fuel argument. We found using fuel convenient since 
MIL program execution is not guaranteed to terminate. In io- 
bounded-execution, we proceed by looking up the instruction 
at the indicated offset, completing that instruction, and moving 
on to the next instruction in the list until fuel runs out. We 
proved the correctness of io-bounded-execution both in terms 
of IO and O0O transitions, but outline only the former and 
defer details to the supplementary material. 

Soundness: If instructions in the initial state (7, s,c, f) 
are sorted by name and completed up to position p, and 
io-bounded-execution((i, s,c, f), p,n) = 7, then 7 represents 
an IO execution starting in the initial state and ending in a state 
where all instructions up to position p+ n are completed. 

Completeness: If the initial state is well-formed, resource 
initialized, and has instructions sorted by name and completed 
up to p, io-bounded-execution will indeed output an execution. 

We used the same approach as for io-bounded-execution to 
develop a verified function dubbed io-bounded-trace that only 
outputs the corresponding trace of an execution from a given 
state, with some basic optimizations to handle large states and 
perform many transitions. These functions are useful not only 
for running concrete MIL programs—they also allow us to 
partially automate proofs [7]. 


B. Refinement of Computable Functions to CakeML 


While feasible for states of small to moderate size, evaluat- 
ing the functions io-bounded-execution and io-bounded-trace 
inside HOL4 can be slow and does not scale to large and long- 
running MIL programs. We therefore refined our datatypes 
and functions for MIL to be compatible with CakeML’s HOL4 
translation frontend [23]. We then proved the refined functions 
equivalent to our previous list-based definitions. Once the 
CakeML translator accepted all our refined functions, we 
obtained a verified MIL evaluator as a native program. 


C. Translation from BIR to MIL 


To allow generating MIL programs from ISA level code, 
we developed an unverified translation in Standard ML (SML) 
from BIR to MIL, using the SML interfaces of each HOL4 
theory. The main SML translation function takes a BIR pro- 
gram term and a function name g, and as a side effect defines 
a function in HOL4 with that name, mapping BIR block 
addresses (and other necessary parameters) to collections of 
MIL microinstructions. The function g then takes the place of 
translate in our MIL semantics; in particular, we can pass 
g to io-bounded-trace together with a (refined) MIL state. 
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Since MIL does not have a canonical expression semantics, 
we manually adapted the BIR expression semantics to MIL 
by introducing the corresponding expression abstract syntax 
and using the same HOL4 word theory operations as BIR in 
an executable MIL expression evaluation function. 


VI. VERIFICATION OF CONDITIONAL NONINTERFERENCE 


We develop a general verification strategy for conditional 
noninterference that follows the hypotheses of a lemma: 

Lemma 1l: If there exist (1) a relation L such that ~¢ 
N ~o C L (i.e., L underapproximates program information 
leakage during IO execution), (2) and a bisimulation R for 
OoO semantics (i.e., R overapproximates program information 
leakage during OoO execution), and (3) ~e N L C R (ie., 
the initial knowledge of the attacker and the IO information 
leakage “are not less than” the OoO leakage), then CNI (~¢). 

Below, we demonstrate our strategy using the MIL program 
Teeq(b1, b2) from Example 2. The supplementary material [10] 
contains applications of our strategy in HOL4 to verify CNI 
for the Example 1 program, the Example 2 program, and a 
program that moves values between two registers. 


A. Computing the Relation L 


Our strategy uses the IO executor function io-bounded- 
execution described in Section V-A to analyze the information 
leakage relation symbolically, together with self composi- 
tion [24]. Since the IO semantics is deterministic, we can 
compute the post-relation by limiting the analysis to maximal 
executions when programs terminate. In fact, a system is 
noninterferent for the IO semantics iff the traces of maximal 
executions of any two states in ~¢ are indistinguishable. 

For a state with eq, the IO executor generates the trace 
[dl bı, dl b2, il (pcg + 4)]. By using self composition, we 
generate the relation L = pcg = pch A bi = bi A bo = bh, 
where primed variables are the parameters for the second state. 


B. Identifying and Proving a Bisimulation Relation R 
Let (hh, $1, Cı, Fi) = gi R gg = (D, $92, Co, Fy). To guar- 
antee that the two states can produce the same observations, 
i.e., lists of fetches and commits, we work under the assump- 
tion of control flow preservation, reflecting the no-branch-on- 
secrets condition common in cryptographic practice. 
This condition leads to a number of constraints on R that 
can be used in a proof search procedure: 
e Preservation of executed, committed and fetched instruc- 
tions, L.e., dom(s1) = dom(s2), Cı = Co, and F; = F>. 
e Preservation of labels (addresses of PC stores and mem- 
ory loads/stores), e.g., $1(tc43) = $2(t.43) for Example 2. 
e Preservation of dependencies (including active stores for 
loads) and guards. For instance, for t22 and te23 in Iceq, 
this leads to s1(tco1) = S2(tc21), since te21 == 1 is used 
as the guard condition of te22 and t.93. 
These constraints are then backpropagated to previous mi- 
croinstructions, which for the example results in requiring that 
the initial value of the flag z (needed for t.92) and pc are the 
same (needed for t.43) in sı and so. 


The bisimulation proof is greatly simplified by control flow 
preservation. The main challenge is to prove preservation of 
the active stores. This is done by showing that an assignment 
to a name t will either have no effect on the active stores, or 
else the same instruction will be eliminated. 


C. Proving the Entailment of the Bisimulation 


The last verification step, ~e N L C R, is largely auto- 
mated. For the initial states, each bisimulation constraint must 
be guaranteed by either L (e.g., for Example 2 the equality of 
pc is implied by pco = pcp in L), when the same information 
is leaked by both the OoO and IO semantics, or by ~¢ (e.g., 
for Example 2, the equality of the flag z can be guaranteed 
only if we consider the initial value of z to be public, since it 
is not leaked by the IO execution), when the OoO execution 
introduces additional leakage. 


VII. RELATED WORK 


A. Theorem proving for hardware and its interfaces 


Specifications of popular ISAs, e.g., ARMv8-A and 
RISC-V, are available for many theorem provers [14], [25], 
[26]. However, compilers and program analysis tools that only 
consider these specifications are unable to rule out illicit infor- 
mation flows due to microarchitectural vulnerabilities such as 
Spectre, Meltdown, and Foreshadow [1]-[3]. On the hardware 
side, theorem prover formalizations are available for HDLs 
and corresponding circuit synthesizers and compilers [27]- 
[32], but program analysis tools using such specifications have 
to target specific low-level microarchitectures and hardware, 
which may be unrelated to high-level languages or ISAs. 

An alternative is to perform end-to-end specification and 
verification across high-level languages, ISAs, microarchitec- 
tures, and hardware. For instance, Lööw et al. [17] connect 
the compiler for the CakeML language to the Silver ISA and 
single-core processor in HOL4, and Erbsen et al. [33] specify 
and verify in Coq the functional correctness (including instruc- 
tion reordering) of a system based on a pipelined processor 
implementing the RISC-V ISA. However, these efforts focus 
only on functional correctness, and are tied to a particular stack 
of ISA and hardware. This makes proof reuse in other settings 
difficult. In particular, the instruction pipeline reordering proof 
by Erbsen et al. is specific to a processor defined in the Kami 
HDL. We believe that MIL, in contrast, can enable proof reuse 
across end-to-end verification efforts. 


B. Formal models of low-level information flow 


Several works have addressed the formalization of microar- 
chitectural optimizations, such as different forms of specu- 
lation, to capture Spectre-like vulnerabilities [21], [34]-[37]. 
Similarly to MIL, these proposals model an attacker that can 
observe the program counter, memory load addresses, and 
memory store addresses. Their security conditions are defined 
as noninterference or a conditional hyperproperty, similarly to 
conditional noninterference, that compares information flows 
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of the same program in a speculative and a sequential se- 
mantics. The semantics by Barthe et al. [35] describes out-of- 
order execution, but memory commits have to be done in-order 
and consequently memory consistency is straightforward. Fa- 
dideh et al. [37] consider a SAT-based register transfer level 
analysis of transient execution for concrete OoO processor 
designs, but they do not provide a general model. Other works 
only consider speculative in-order instruction processing. 

Many of these works have inspired the implementation of 
tools, e.g., Spectector [38], to analyze program side channels 
using some form of (relational) symbolic execution. However, 
to our knowledge, the only mentioned work whose semantics 
and verification approach has reached an interactive theorem 
prover is that of Cheang et al. [36], which was formalized by 
Griffin and Dongol in Isabelle/HOL [39]. Using a translation 
from C-like programs to Isabelle theories similar to our BIR 
translation, Griffin and Dongol reason about information flow 
during speculative execution using Hoare-style triples, but 
they do not account for out-of-order execution. While our 
MIL semantics introduces nondeterminism relationally, i.e., 
by some states simply having several possible transitions 
according to the OoO step relation, the semantics used by 
Griffin and Dongol consults an abstract oracle to resolve 
nondeterminism [21]. 


C. Validation of hardware information flow models 


Buiras et al. [15], [40] and Oleksenko et al. [41] developed 
tools (called Scam-V and Revizor, respectively) to validate 
hardware information flow models. The approaches are based 
on testing leakage models (e.g., the attacker observations 
of Section IV-E) using black box testing on actual CPUs. 
Both Scam-V and Revizor use a variation of conditional 
noninterference, where the goal is to establish that states that 
produce indistinguishable traces in a model produce indistin- 
guishable cache footprints on the real hardware. We believe 
such tools can facilitate trustworthy connections between MIL- 
based information flow analyses and hardware behavior. 


VIII. CONCLUSION 


We presented a formalization in HOL4 of MIL, a language 
which captures key features of microarchitectures to allow 
reasoning about low-level program information flow. The 
formalization includes the in-order and out-of-order dynamic 
semantics of MIL, a proof of memory consistency between the 
two semantics, and a notion of conditional noninterference 
that rules out trace-driven cache based side channels. The 
formalization is around 34,000 lines of code with examples, 
and took around 24 person months to develop. The code [10] 
was tested on HOL4 kananaskis-14 and PolyML 5.9. 

We envision that our MIL formalization and tools will 
be integrated into a trustworthy program information flow 
analysis workflow based on CakeML, where binaries for ISAs 
supported by HolBA are first represented in BIR and then 
translated to MIL to establish conditional noninterference 
or to demonstrate side channels. Our unverified BIR-to-MIL 
translation and verified example programs indicate that the 


workflow is feasible, but the manual effort of conditional non- 
interference proofs is currently the main obstacle. In particular, 
bisimulation based reasoning can easily lead to unproductive 
exploration of the many possible transitions available due 
to nondeterminism. However, even without full automation 
of conditional noninterference proofs, we believe MIL and 
its metatheory and tools can improve productivity in formal 
verification of confidentiality properties of practical systems. 
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Abstract—Creating a compiler for an instruction set archi- 
tecture (ISA) requires a set of rewrite rules describing how to 
translate from the compiler’s intermediate representation (IR) to 
the ISA. We address this challenge by synthesizing rewrite rules 
from a register-transfer level (RTL) description of the target 
architecture (with minimal annotations about its state and the 
ISA format), together with formal IR semantics, by constructing 
SMT queries where solutions represent valid rewrite rules. 

We evaluate our approach on multiple architectures, support- 
ing both integer and floating-point operations. We synthesize both 
integer and floating-point rewrite rules from an intermediate 
representation to various reconfigurable array architectures in 
under 1.2 seconds per rule. We also synthesize integer rewrite 
rules from WebAssembly to RISC-V with both standard and 
custom extensions in under 4 seconds per rule, and we synthesize 
floating-point rewrite rules in under 8 seconds per rule. 


I. INTRODUCTION 


The end of Moore’s law and Dennard scaling means that 
processor performance will not continue to increase expo- 
nentially due to improvements in process technology. Future 
performance increases will instead be due to the increased 
efficiency of domain-specific architectures and accelerators. In 
their Turing Award lecture, John Hennessy and David Patter- 
son envision such a future; they predict that these innovations 
will lead to a new golden age of computer architecture [33]. 
In order to realize this vision, there must be a corresponding 
golden age of software tools, programming models, and com- 
pilers to design and program specialized architectures [55]. 

Every new instruction set architecture (ISA) must be ac- 
companied by a set of rewrite rules to be used in code 
generation. These rules describe how to transform a compiler’s 
intermediate representation (IR) to the ISA. Crafting these 
rules is a labor-intensive task and is often performed by 
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someone other than the ISA designer. Hence, the ISA must be 
carefully documented to support compiler writers—this too is 
a tedious, error-prone process. Moreover, changes to an ISA 
require new documentation and new rewrite rules. 

This leads to a world where there are very few ISAs, 
and design space exploration is limited to microarchitectural 
details. To perform architectural design space exploration, a 
working compiler is critical to perform realistic benchmarking. 
The work in this paper arose in the context of the Agile 
Hardware Project, where one of the primary goals is to 
facilitate rapid design space exploration for a coarse-grained 
reconfigurable array (CGRA) [7]. We found that manually 
maintaining rewrite rules for a rapidly changing architecture 
was a constant pain point. This experience led us to develop 
a method for automatically synthesizing instruction selection 
rewrite rules, which is the primary contribution of this paper. 
Our method requires a register-transfer level (RTL)! descrip- 
tion of the target architecture, a description of the architectural 
state, and a description of the instruction format. This method 
has made possible the efficient and algorithmic exploration of 
large design spaces [41], as generation of the rewrite rules can 
be efficiently performed without a human in the loop. 

Even for established ISAs, it is easy to overlook nuances 
that are obvious to the ISA designers. This can lead to ineffi- 
ciencies in compiled code. For example, the RISC-V ISA does 
not include equals or not-equals instructions but documents 
“pseudo operations" for performing them using a subtract and 
an unsigned less-than ((x - y) < 1 and 0 < (x - y) 
respectively) [2]. Similarly, there are no instructions for less- 
than-or-equals or greater-than-or-equals, each of which can 
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also be implemented as two instruction sequences using a 
less than and an xor ((y < x) ^ 1 and (x < y) ^ 1, 
respectively); however, these sequences are not documented. 

Using the architecture’s RTL, we synthesize rewrite rules by 
constructing first-order logic queries whose solutions, obtained 
using a Ssatisfiability modulo theories (SMT) solver, represent 
instruction selection rewrite rules. Additionally, we propose 
a methodology for abstracting complex operations, such as 
floating point operations, which proved too costly for previous 
SMT-based approaches [12]. While some prior work [18], 
[19], [49], [12] tackled similar problems, they used manually- 
defined ISA specifications in the form of enumerated lists of 
instructions with their parameters and semantics. Using RTL 
directly has the benefit of avoiding this manual specification 
step. This is particularly important when doing design space 
exploration, as it is difficult to maintain both the RTL and 
a corresponding formal specification for a rapidly changing 
design. ISA specifications also do not typically capture the 
instruction format or the instruction decode logic, both of 
which are needed for an end-to-end correctness argument. In 
addition to these benefits, using the RTL directly also presents 
unique challenges which we address. Our main contributions 
are as follows: 


e Formalization of the correctness criteria for a general 
class of rewrite rules between arbitrary IRs and RTL- 
based architectures. 

e A technique for supporting parametric rewrite rules. 

e A method for abstracting operations whose semantics are 
either unknown or too complex to model efficiently (e.g., 
floating-point operations). 

e A methodology for efficiently encoding and solving the 
rewrite rule synthesis problem using SMT. 


In our evaluation, we synthesize rewrite rules from CoreIR 
(an IR designed for RTL) [16] to a family of CGRAs. We also 
synthesize rewrite rules from WebAssembly to various RISC- 
V architectures. We target both the base RISC-V ISA and 
a number of extensions, including extensions with floating- 
point operations. All of these tasks can be done in seconds. 
Additionally, we are able to synthesize short multi-instruction 
sequences for pseudo-operations such as those mentioned 
above (whether officially documented or not). These take at 
most 90 seconds to synthesize. 

The rest of this paper is organized as follows. Section II pro- 
vides background on compilers, instruction selection, rewrite 
rules, and SMT. Section II formalizes rewrite rules and 
describes our encoding of the problem into SMT. Section IV 
presents case studies highlighting the utility and performance 
of the tool. Section V covers related work and, finally, 
Section VI provides future steps to take towards a general, 
automatically-derived compiler. 


II. BACKGROUND 


A. Code Generation 


Most compilers share a common structure: a front end 
which translates a high-level language into an IR, an optimizer 


Notation Meaning 

BV in] Sort for bitvectors of length n 
Hin in] X [n] in] Arithmetic modulo 2” 

+fin] n-bit floating-point addition 
roy Bitvector concatenation 
x[msb : lsb] Bitvector extraction 

ite(c, x, y) If-then-else: if c then x else y 


alt] Read from array a at index i 


ali] := v Result of updating array a at 
index i with value v 
I= Algebraic data type T with 
Cı(sı : c1) | constructors C1, C2, testers is_C1 


C2(s2 :02,83:03) and is_Co, and selectors s; of sort c; 


TABLE I: Theory-specific notation. 


which optimizes the IR, and a code-generator which translates 
the IR into a hardware-specific representation (which then 
may be further optimized for the target architecture). The 
code generation stage typically involves instruction selection, 
scheduling, resource assignment, and assembly. 

There has been significant work devoted to developing 
instruction selection algorithms [29], [25], [26], [45], [20], 
[4], [24], [23], [10] that use a set of pre-defined rewrite 
rules to translate IR programs to architectural instructions. 
These rewrite rules are dependent on the target ISA and are 
usually constructed manually. In this paper, we automatically 
synthesize rewrite rules from the RTL of target architectures; 


B. Logical Setting 


We work in the setting of many-sorted logic (see e.g., [21], 
[54]). Let S be a set of sort symbols (sorts in this setting 
play a role similar to types in type theory). For every sort 
o € S, we assume an infinite set of variables of that sort. We 
assume the usual definitions of terms, literals, formulas, and 
interpretations, and use to denote the satisfiability relation 
between interpretations and formulas. We write e { x ++ t } for 
the result of simultaneously replacing each occurrence of x in 
e by t. If xı and xg are two vectors of variables, we write 
Xı | X2 to denote their concatenation. A term of the form 
ite(y, tı, t2) is an if-then-else operator, whose meaning is 
the same as tı in an interpretation J where J = y, and the 
same as tg otherwise. 

A theory T assigns meaning to certain theory-specific 
symbols by fixing a class of allowable interpretations (e.g., 
it may fix the meaning of the symbol ‘+‘ to be the addition 
function). A formula y is 7-satisfiable (resp., T-unsatisfiable, 
T-valid) if it is satisfied by some (resp., no, all) interpretations 
in 7. The satisfiability modulo theories (SMT) problem is 
simply the question of determining 7-satisfiability of a formula 
for some given theory 7. SMT solvers solve this problem for 
a standard set of useful theories (and their combinations). 

Some examples of common theories supported by SMT 
solvers include fixed-width bit-vectors, arrays, integer and 
floating-point arithmetic, uninterpreted functions, and alge- 
braic data types. Table I lists some notation from these theories 
that we will use in illustrative examples below. A more 
thorough introduction to SMT can be found in [9]. 
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II. SYNTHESIZING REWRITE RULES 


Rewrite rules are a key component in instruction selection, 
as they indicate the options for how to transform one or 
more IR instructions into one or more architecture-specific 
instructions. In this section, we show how to formalize and 
solve the rewrite rule synthesis problem using SMT. 


A. Intermediate Representation Formalization 


An intermediate representation (IR) includes a collection 
of instructions which can be composed together in various 
ways to represent programs. IR instructions can be represented 
in many ways, including as graphs or as functions. Here, 
we represent IR instructions as SMT formulas. The formulas 
encode how an instruction’s inputs are transformed into a set 
of outputs. Formally, let x = (a1 : 01,...,@% : Cp) be a vector 
of variables. Then, the tuple IR(x) = (1Ri(x),..., [Ri(x)) 
is an IR instruction with k inputs (each x; is an input) and / 
outputs (represented by each [R;). The value of output j for a 
given concrete input (c1,...,c) is given by constructing the 
formula [R;(ci,...,cx) and then evaluating it using the se- 
mantics of the theory operations in the formula. For example, 
an 8-bit adder with two outputs, the sum and the carry-out, 
and inputs xı and x2 of sort BV |g) could be represented as: 


(xı +13] z2, (00 £1 +19) 00 x2)[8 : 8]). 


For the concrete input (11111111, 00000001), the outputs are 
00000000 and 1, respectively. 

A formula-tuple TR need not represent only a single in- 
struction. A complex operation or pseudo-instruction can be 
represented as a composition of other instructions. Our SMT 
representation can easily accommodate composition: if an 
output IR; from IR; is connected to an input x; in IR2, 
then the composition is simply the result of substituting [R; 
for x; in IR3, i.e., IR2 { x; + IR; }. Below, we assume that 
IR represents some IR program (comprising one or more IR 
instructions) that we wish to find a rewrite rule for. 


B. Architecture Formalization 


An architecture is a circuit that is parameterized by a single 
architectural instruction value (separate from and not to be 
confused with the IR instructions mentioned above), which 
indicates how other inputs and existing states are transformed 
into outputs and next states. As above, we represent an 
architecture as a tuple of SMT formulas. The instruction 
itself is an input to the architecture, which we assume can 
be modeled as a variable inst of sort 7. We further let 
y = (Y1 : T1,- - -Ym : Tm) be a vector of variables with sorts 
in X, where 7; is the sort of the architecture’s 7’th input. The 
tuple Arch(inst, y) = (Archi (inst, y),..., Archn (inst, y)) 
is an architecture with m + 1 inputs and n outputs. As an 
example, consider an 8-bit ALU with 4 operations. An input 
inst of sort BV jọ] selects which operation to perform on two 


other inputs, yı and y2, both of sort BV jg). Its single output 
is also of sort BV{g). For this example, Arch could be: 


(ite(inst = 00, yı —|s} Y2, 
ite(inst = 01, yı +1] Y2, 
ite(inst = 10, yı *[g] Yo, Yı +s] Y2))))- 


States. Architectures with states can be modeled by including 
current state values as inputs and next state values as outputs. 
Suppose z = (21 : W1,...,%p : Wp) are variables representing 
the states. Then, we can represent the architecture as: 


Arch(inst, y,z) = 
(Arch, (inst, y,z),..., Archy (inst, y,z), 
Archn+1 (inst, y,Z),..., Archy+ (inst, y,z)), 


where Arch,,1; are formulas that encode the next-state func- 
tion for the it” state variable. An example with states appears 
in Section III-C, below. 


Composing Architectures. A rewrite rule for an IR program 
might require more than one instruction at the architec- 
tural level. Fortunately, as was the case for IR programs, 
it is straightforward to compose multiple architectures us- 
ing our SMT representation. Let Arch, (inst, Y1, Z1) and 
Archs(inst2, y2,Z2) be two architectures with mı and mo 
inputs, pı and pə states, and nı and ng outputs, respec- 
tively, and suppose that output 7 of Arch, is passed into 
input j of Arch. Let inst = (inst), inste), Y = yi 
(y2,1; <- +5 Y2,7-1, Y2,j4+15-- Uma) Z = Z1 :: Za, and Ys = 
Y2 { y2,j + Arch ,;(insti,y1,z1)}. Then, the composition 
is: 


Arch(inst, y,z) = 
(Arch, 1(insty, Y1, Z1), ---, Archi», (inst, y1, Z1), 


Archg,1 (insta, Y2, Z2), Resy Arch n, (insta, Y2, z2), 
Archi ni1 (insti, y1,Z1),..-, Archi n,+p, (insti, y1, Z1), 
Archon. +1(inst2, Y2, Z2), . . - , ATCha no +p, (inste, yg, Z2)). 


C. Rewrite Rule Formalization 


A rewrite rule defines how a specific IR program can be 
implemented using one or more instructions of a particular 
architecture. We start with a simple but incomplete definition 
of a rewrite rule and incrementally build up a definition with 
more generality and sophistication. The simplest rewrite rule 
is a tuple (IR, Arch, inste), where IR is an IR program, 
Arch is an architecture (without states for now), and inste is 
a concrete constant (i.e., a constant that maps to a particular 
domain value, like 0 or 1) of sort r. We say such a tuple is a 
valid rewrite rule if the following formula is well-formed and 
T-valid: 

Yx. Arch(inst.,x) = IR(x) (1) 


Note that well-formedness requires that Arch and IR have 
the same number of inputs and outputs and that corresponding 
inputs and outputs have the same sort. As an example, take 
again the sum output of the IR program given in Sec. II-A, 
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that is, IR = (21 +s] £2), and suppose Arch is as given 
in Section III-B. Then, (1) holds when inst, = 01, and so 
(IR, Arch, 01) is a valid rewrite rule. In practice, things can 
be more complicated in several ways, which we address next. 


Bindings. One problem with (1) is that the inputs and outputs 
of the IR rarely match those of the architecture. A more 
general rewrite rule is (IR, Arch, inst.,b’",b°“’), where 
(b’", b°“*) is a pair of formula tuples, called a binding, that 
specifies how to map between the inputs and outputs of the 
two formulas. The rewrite rule is valid if the following formula 
is well-formed and valid: 


Vx. b?" (Arch(inst., b” (x))) = IR(x). (2) 


Here, well-formedness means b’"(x) = (bi"(x),..., b’"(x)), 
where each b?” (x) has sort 7;. We also require b°“’(w) = 
(be (w),..., be" (w)), where w = (wi,..., Wn), the sort 
of each w; matches Arch;, and the sort of each bout matches 
IR,;. As an example, consider b™ = (x2, 71) and b°“’ = (wy). 
This binding swaps the two IR inputs and only uses the second 
architecture output. 

Another complexity with bindings is that sometimes it is 
necessary to map the IR inputs to only a subset of the 
architecture inputs (for example, mapping a unary IR operation 
to an ISA supporting only binary operations). The extra inputs 
which do not correspond to any IR input must not have 
any effect on the output. To model this, we extend b’” so 
that, in addition to x, it also takes additional arguments 
y = (Y1---Ym) with sorts (71,...,7m). The idea is that the 
binding can choose to simply map some variable y; to an extra 
architecture input. With this extension, we can write the new 
rewrite rule formula as follows: 


Vx, y. b° (Arch(inst., b” (x,y))) =IR(x). (3) 


Finally, we can handle the full generality of architec- 
tures with states by including these in the binding as 
well, where b’” is extended to be a function of sort 
(01... Ok; T1.. Tk; W1. --Wp) > (Tı... Tm, W1.. -Wp), and 
b°“t also takes an additional p inputs of sort w1, ... , Wp. 

V x,y,z. b°'(Arch(inst.,b' (x, y,z))) = IR(x). (4) 


As an example, consider a simple architecture which ei- 
ther multiplies its inputs and accumulates the result into a 
register file z (represented by an array variable) at index 0 
while outputting the product or performs a subtraction, both 
outputting the result and storing it at index 1 of the register 
file. Assume the instruction is of sort BV (1), and the other 
inputs are of sort BV jg). All operators use 8-bit arithmetic (so 
we will omit the [8] subscript to ease readabilty). The formula 
for the architecture is then: 


Arch(inst, yı, y2, 2) = (ite(inst = 0, yı * yo, Y1 — Y2), 
ite(inst = 0, z[0] := z[0] + (yı * y2), z[1] := yı — y2)) 
Note that the first formula in the Arch tuple represents the 
output of the architecture, while the second represents the next 


state of z. Now, suppose we are searching for a rewrite rule 
for IR(x) = (£3 * £2) + xı. One valid rule is inste = 0, 


b” (x, y, z) = (£3, £2, 2[0] := 21), and be (w) = w2[0] 
(note that we, represents the second input to b°%t, which 
corresponds to the register file state). This rule represents a 
solution using inst. = 0 when 2, is the value of z[0], x2 
drives the y2 input, and x3 drives the yı input. The result is 
stored at index 0 of the (next state value of the) register file. 


D. Rewrite Rule Synthesis 


We next formalize the problem of synthesizing rewrite rules. 
We assume that we are given IR and Arch representing an 
IR program and an architecture, respectively. We must find 
inste, b”, and b°“*. Starting from (4), we can simply replace 
inste, b', and b°% with variables to get a (second-order) 
formula. It is also useful to make the bindings a function of 
the instruction, as we explain below. Thus, we have: 


inst, b” b°". Y x,y,z. 


b°" (inst, Arch(inst, b* (inst, x, y,z))) = IR(x). (5) 


If (5) holds, then there exists a valid rewrite rule. 

In order to use (5) for a practical rewrite rule synthesis al- 
gorithm, we must additionally specify what kinds of functions 
are allowed for b’? and b°**. These functions should tell us 
how to map the inputs and outputs, but should not introduce 
extra functionality. For non-state inputs to the architecture, we 
simply require that the binding either pick a variable in x or 
pass through the corresponding variable from y. 

For state inputs, there are two? cases. For programmable 
states (states with compile-time addresses that can be written 
and read by instructions, e.g., a register file), we allow the 
binding to update part of the state with a variable in x. 
This corresponds to a previous instruction storing its result 
(the input for the current instruction) in the state. We do 
this by using array variables for these states and allowing 
the binding to write to the arrays. Other states, such as the 
accumulators or other non-programmable registers, are passed 
through unchanged by the binding. Formally, we require: 


yi or xj(1 < j < k), ifi < m, 
bi” (inst, X,Y,Z) = < Zi-m, if i > m(non-programmable) 
update(zi;—m, inst, x), otherwise, 
where update(z, inst, x) is one or more array writes to z at 
indexes specified by one or more fields in inst and with values 
from the variables in x. The output binding is similar: 
wi (1<i<n+p), or 
read(w;,inst)(n+1<i<n+p), 


where w; is programmable, 


to; = 
be" (inst, w) = 


where read(w, inst) is a read from the array w at an index 
specified by some field of inst. Implicit in this formulation 
is the requirement that instructions must either directly output 
their result or write them to programmable state in a single a 


2A third kind of state with computed addresses (like indirect loads and 
stores), can be handled in a way similar to [12], or by using the computed 
address from the architecture and the IR in the output bindings. 
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cycle. Pipeline registers and other micro-architectural state fall 
into the category of states which cannot be bound. We discuss 
possible approaches for handling pipelining in Section VI. 
We next explain how to solve (5), subject to the constraints 
on bindings. But first, we introduce two useful generalizations. 


Synthesizing Parametric Rewrite Rules. Sometimes, we are 
interested in finding a parameterized rewrite rule that works 
for a family of IR nodes (for instance, the family of IR 
instructions that multiply a constant parameter by some input). 
Rather than having to discover a different rewrite rule for each 
value of the parameter, we would like to solve the problem 
once and have it work for all possible values of the parameter. 
Formally, let c be a vector of parameters, and let IR(c, x) be 
a family of IR nodes parameterized by c. Using equation (5) 
as a Starting point, the new rewrite formula becomes: 


Vc. dinst, b”, bo“. Y x,y, zZ. 
b°" (inst, Arch(inst, b” (inst, x, y,z))) =IR(c,x). (6) 


In other words, we would like there to be an appropriate 
instruction encoding for each value of the parameter c. As 
it stands, this formulation is not very useful, as it does not tell 
us how to connect the instruction to the parameter. However, 
by Skolemizing (6), we get the following: 


inst, b”, b°". Vc, x,y,z. 
b° (inst(c), Arch(inst(c), 
b” (inst(c),x,y,z))) =IR(c,x). (7) 


where now, inst is a function from c to instructions.? 


Abstracting Complex Operations. Complex operations (e.g., 
floating-point arithmetic) can present a challenge. However, 
it is often the case that there are identical complex oper- 
ations in the IR and in the architecture. We can handle 
such situations by replacing such complex operations with 
uninterpreted functions [13]. We must be careful about how 
this is done though. If we simply introduce new function 
symbols in the formulas for the IR and the architecture, they 
will be implicitly existentially quantified when checking for 
satisfiability, leading to spurious results as the solver can 
choose any interpretation. Hence, introduced function symbols 
must be universally quantified. Formally, let Arch?’ and 
IR” be the abstract versions of Arch and IR, respectively, 
where the complex operations are removed and replaced with 
a vector of function symbols f. Then, building on (7), we get 
the following formulation for the fully general rewrite rule 
synthesis formula: 


Jinst, b” b°. Vc, x,y,z, f. IR? (c, x, f) = 
b°“! (inst(c), Arch®”S (inst(c), f, b” (inst(c), x, y, z))). 
(8) 
3Technically, to maintain logical equivalence, b?” and should also be 
functions of c, but for simplicity, we omit this, keeping the restrictions on 


their form introduced above. We also did not find any additional dependency 
on c to be needed in practice. 
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E. Rewrite Rule Synthesis Implementation 


Here, we detail several additional considerations required 

to solve the rewrite rule synthesis problem formalized above 
in practice. Specifically, we discuss (i) removing second-order 
quantifiers; (ii) encoding instructions; (iii) formula optimiza- 
tions; and (iv) solving algorithm optimizations. 
Removing Second-Order Quantifiers. Note that inst, b’”, 
be", and f are all quantified functions. In order to use an SMT 
solver, we first need to find an equivalent formulation using 
only first-order quantification. For the binding functions, this 
is straightforward. Given the restrictions outlined above, there 
are only a finite number of possible binding functions.’ Let B 
be the set of all legal bindings (b’”, b°“”). Then, formula (8) 
is equivalent to 


inst. VC, X, y,Z, f. VV 

(bin bo“ )EB 
b° (inst(c), Arch?’ (inst(c), f, b” (inst(c),x, y,z))). 
(9) 


Unfortunately, just satisfying this formula does not tell us 
which binding to use, so in practice, we also add an indi- 
cator variable i, whose value indicates which binding was 
used. Formally, we extend the notion of binding to a triple 
(b, b*”, b°““), where b is an integer unique to each binding. 
Then, our formula becomes: 


IR’ (c,x,f) = 


dinst,i.Ve,x,y,z,f. \y  i=bAIR*(c,x,f) = 


(b,b™ bo“) EB 
be (inst(c), Arch (inst (c), f, b’” (inst(c), x, y, Z))). 
(10) 


To remove the quantification on f, we can use Ackerm- 
annization [3]. For each function symbol f, we replace each 
instance of f with a fresh variable of the same sort as the return 
sort of f and add constraints requiring that if the arguments 
to any two of those instances of f are equal, then the fresh 
variables representing those instances are equal too. Assume, 
for ease of presentation, that f = (f) and f appears only 
once in IR?"°, with arguments s, and once in Arch®*, with 
arguments t. Then, (10) is equivalent toñ 


Jinst, i. Vc, X, y, Z, fi, f2- VV t=bA 


(b,b bout) EB 
(s =t > fi = fo) > IR™ (c,x, fi) = 
b° (inst(c), Arch (inst(c), fo, b” (inst(c), x, y, Z))). 
(11) 


Encoding Instructions. Above, we have assumed a simple 
instruction model, where instructions are taken from some 
sort 7. In practice, an architecture may have a variety of 


4To ensure finiteness, we limit the update operation mentioned above to 
allow no more updates than there are IR inputs. 

5With some abuse of notation, if P( f) is a formula containing f, and fı is 
a variable whose sort matches the return sort of f, we write P(f1) to mean 
the result of replacing the application of f in P by fi. 
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instructions, each with different components. This can be 
modeled by letting T be an algebraic data type (ADT), with 
different constructors for each type of instructions. This also 
solves the problem of how to handle inst as a function of 
c. Some types of instructions allow immediate values to be 
encoded as part of the instruction. For those instructions, we 
allow a parameter from c to appear as the immediate value. 
This is a very limited type of functional dependence on c, but 
it is sufficient for modeling the kinds of parametric rewrite 
rules we are interested in. 

To see how this works, consider as an example two formats 
from the RISC-V integer instruction set (RV32I): (i) R- 
type: register-register instructions; and (ii) I-type: register- 
immediate instructions. We can model these using the ADT: 


INST = RT ype(op: BVi7z, rd: BV js), func3: BV jg), 
rsl: BVis), rs2: BVis), func? : BV i) | 
IType(op: BV i, rd: BVj5), func3 : BV i3), 

rsl: BVjs), imm: BVa) 


This could be further refined by declaring op, func3, etc. as 
additional data types with limited sets of values. To handle 
the dependence on parametric values, we add a constraint 
stating that some immediate value is equal to a parameter. 
For example, if we want to encode the case where the 
immediate of [Type is a constant c, we add the constraint 
is_IType(inst) \imm/(inst) = c. To consider many possible 
mappings of constants to immediates, we use a disjunction 
over a set of possibilities as we do with bindings. 


Formula Optimizations. 

For non-trivial designs, it is too expensive to repeat the 
architecture and IR formulas for every disjunct in the set of 
bindings. An alternative is to introduce additional variables 
for the inputs to and outputs from the architecture and to 
have the bindings operate only on those variables. For ease 
of presentation, let’s go back to formula (5) and write it as: 


dinst. Vx, y,z. 
b°”! (inst, Arch(inst, b’” (inst, x, y,z))) 
(bin bout) EB 


=IR(x). (12) 


This is equivalent to: 


Jinst. Vx, y, Z, U, V, W. 


\/ (b*(inst,u) =v A bi (inst,x,y,2) = w) 
(bin bout EB 


> (Arch(inst,w) = u ^A v = IR(x)). (13) 


In practice, it can also be inefficient to include memories 
and register files in the architecture. An alternative is to remove 
them and add an additional input for every read port and output 
for every write port. From the point of view of the rewrite rule 
synthesis, the problem is equivalent. This is the approach we 
take in our experiments. For example, the RISC-V register file, 


which has the property that register O always holds 0, can be 
modeled with two formulae: 
One for reads: 


. 0 if rs2=0 
0 if rsi=0 ; 
let r4 = ~ g= vz, if rs1 =rs2 40 
vı otherwise , 
v2 otherwise 
in (rı, r2) 


and one for writes: (ite(rd = 0, s, v)) 

In the first formula, vı and v2 are the values bound into 
the register file (or more precisely added as inputs to the 
architecture). rı and rə represent the values read from the 
register file. rs? and rs2 are the read addresses calculated by 
the architecture from its instruction. Note that this is equivalent 
to having two reads on an array without an intervening update. 
However, it massively simplifies the task of generating b’”, as 
we do not need to reason about how rs? and rs2 will be 
derived from inst. 

In the second formula, rd is the write address, s represents 
the previous state of the written register, and v is the value to 
be written. Similar to the abstraction of reads, this significantly 
simplifies the generation of b°“*. These simplifications are 
possible as we do not care about the full state of the register 
file. We only care about the two indices which are read and 
the one index that is written. 


Solving Strategy. While some SMT solvers have support 
for quantified formulas, it is well-known that quantified for- 
mulas often lead to performance and robustness problems 
(and indeed, we observed this in preliminary experiments). 
We therefore adopt an external technique to solve the final 
quantified SMT queries, all of which are in exists-forall form: 


da.Vb. ¢(a, b) (14) 


Our technique is inspired by the counter-example guided 
synthesis (CEGIS) [51] approach introduced in [15] and 
more formally described in [27]. The algorithm consists of 
alternating phases. The algorithm first suggests a solution for 
a by simply checking the satisfiability of ¢(a,b). If ae is 
the value found, it then checks whether this works for all 
values of b by checking the satisfiability of sé(a., b). If this 
is unsatisfiable, then a, is a solution for a in (14). Otherwise, 
let be be the satisfying value found. We simply update ¢ to 
be (a, b) A (a, bc) and repeat. Essentially, we thus collect 
many sample points, be with the hope that after enough are 
collected, it will drive the search to find a value for a that 
satisfies (14). We found that in our setting of rewrite rule 
synthesis, this approach works well. 


IV. EVALUATION 


We evaluate the above approach for rewrite rule synthesis 
by showing the ability to efficiently synthesize rewrite rules in 
two settings. First, we synthesize rewrite rules from the Cor- 
eIR intermediate representation to different CGRA processing 
elements and, second, from the WebAssembly intermediate 
representation to RISC-V with extension. 
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We implement the architectures in the Magma hardware 
description language [1], [55]. We chose Magma as it has 
first class support for formal analysis through its associated 
“hwtypes” library [37], whose semantics match those of the 
SMT-LIB theory of bitvectors. We construct an SMT formula 
for the architecture by tracing the inputs of the circuit and 
the outputs of the architectural state to the outputs of the 
circuit and the inputs of its architectural state. While Magma 
is convenient, it is not essential; any HDL could be used to 
generate a formal model. We specify IRs directly in SMT 
using pysmt [27] and use Boolector [43] as the SMT solver. 
Additionally, we implement minimal compilers which apply 
the synthesized rules in order to compare to existing hand- 
coded tools. Details of our full experimental set up and more 
results can be found in the appendix. 


A. Rewrite Rules for CGRAs 


Our first case study targets CGRAs, style of spatial ar- 
chitecture similar to FPGAs which have been of increasing 
interest to both academia and industry. CGRAs differ from 
FPGAs by employing larger processing elements (PEs) instead 
of lookup tables (LUTs). Further, CGRAs typically have more 
restricted word-level routing networks rather than bit-level 
routing networks [39]. We evaluate our ability to synthesize 
rewrite rules for such architectures by synthesizing rewrite 
rules from CoreIR to four different PEs. We chose CoreIR 
as a source IR as it is formally specified [16], [40]. 

1) CGRA Processing Element Implementation: We use four 
versions (PE-A, PE-B, PE-C, PE-F) of an internally developed 
16-bit processing element. PE-A contains a two-input ALU 
that can perform bit-wise operations, comparisons, shifts, 
addition, and multiplication, along with a lookup table for 
Boolean operations. Each ALU input can be driven by an 
external signal or a local immediate constant. PE-F adds 
16-bit floating point (bfloatl6) addition and multiplication 
to PE-A. We then extend PE-A with operations commonly 
occurring in image processing applications. PE-B extends 
PE-A with absolute difference (|x-y|), and PE-C extends 
PE-B with fused multiply-add with an immediate constant 
(xxconst + y). Generating such a collection of similar 
architectures is a common practice when doing design space 
exploration. Our synthesis method, combined with a tool such 
as VTR [42] to perform place and route, could enable a 
designer to evaluate a large design space on real benchmarks. 

2) Rewrite Rule Synthesis: We evaluate our ability to 
synthesize rewrite rules for CoreIR’s 16-bit integer instructions 
(i16), Boolean instructions (i1), and floating point instruc- 
tions using Bfloat16 [17] (bfloat16). 

The times to derive these rewrite rules are shown in Fig- 
ure 1. Note that while most CoreIR operations can be mapped 
to the base PE, some can only be mapped to one or more of 
the variants. Each rule for the integer PEs can be found within 
1.1 seconds. Additionally, the floating-point instructions can 
be found for PE-F within 1.2 seconds. 

In Table II, we show the total time in seconds spent 
synthesizing rewrite rules (a SAT result) or proving that no 
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Fig. 1: The median time over 10 runs needed to derive a 
rewrite rule for various CoreIR operations to different PE 
architectures. 


PE-A | PE-B | PE-C PE-F 

UNSAT (s) 0.81 0.74 0.34 118.09 
SAT (s) 8.63 10.15 | 11.06 91.49 
Total (s) 9.44 10.88 | 11.40 | 209.58 


TABLE II: Total time generating SAT results and UNSAT 
results, for each PE design. 


rewrite exists (an UNSAT result, potentially due to the lack 
of a matching abstraction) for each PE design. Targeting the 
integer PEs is extremely fast, taking less than 12 seconds per 
design to generate a full set of rewrite rules. The process is 
"slow" for PE-F requiring about 3.5 minutes. However, this 
time is trivial compared to the time it would take to manually 
write these rules. 


B. Rewrite Rules for RISC-V 


Our second case studies shows how our technology can be 
used to synthesize rewrite rules from WebAssembly targeting 
RISC-V processors. WebAssembly is an intermediate repre- 
sentation designed to be a target for web applications. The 
IR itself has formally-defined semantics for each operation, 
making it suitable for our method. 

We extract the post-instruction-fetch portion of the proces- 
sor in order to give it the appearance of having an instruction 
input. Further, we replace the register file with the simpli- 
fied model described in Section I-E. These transformations 
require only a handful of lines of boilerplate python for 
each architecture. Additionally, we construct specifications 
of instruction formats as ADTs and provide any necessary 
annotations for the register file (i.e., which registers have 
special semantics, like register 0 in RISC-V). 

1) RISC-V Implementation: In addition to implementing a 
processor for the base RV32I ISA, we implement processors 
for the RV32IM and RV32IF standards. The "M" extension 
adds instructions for multiplication, division, and remainder. 
The "F" extension adds support for floating point operations. 
Full details can be found in the RISC-V manual [2]. In addition 
to these standard extensions, we define our own extension 
RV32X, which adds common bit-counting operations, which 
are defined in WebAssembly. Specifically: count-leading-zeros 
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Instruction RV32I | RV32IM | RV32IX | RV32IF 
i20.const 0.3 10.4 1.8 4.2 
i32.le_s 2.2 27.3 3.7 80.1 
i32.ge_s 1.6 30.8 4.5 71.7 
i32.le_u 1.6 25.7 47 75.1 
i32.ge_u 2.4 18.1 2.2 51.2 

i32.eq 2.1 23.5 3.3 22:3 

i32.ne 2.2, 6.4 1.2 9.9 


TABLE III: Median SMT performance in seconds for syn- 
thesizing two sequential instructions for i20.const and 
comparison instructions. 


(132 .cl1z), count-trailing-zeros (i32.ctz), and population 
count (i32.popent). 

2) Rewrite Rule Synthesis: We evaluate our ability to 
synthesize rewrite rules for WebAssembly’s 32-bit integer 
instructions (i32) and a subset of floating point instructions 
(float). The integer instructions also include pop-count, 
count-leading-zeros, and count-trailing-zeros. 
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Fig. 2: Time needed to synthesize a single RISC-V instruction 
for each RISC-V Architecture. SAT means a rewrite rule 
was discovered. UNSAT means there is provably no single 
instruction rewrite that is possible. Reported times are the 
median result over 10 runs. 


In Figure 2, either the time to synthesize a rewrite rule (SAT) 
or the time to prove that a rewrite rule does not exist (UNSAT) 
is shown for each IR instruction. Synthesis for RV32I succeeds 
in finding all instructions executable as a single instruction on 
the target architecture. For the integer processors, all rules 
are discovered within 4.1 seconds, with most only taking a 
few hundred milliseconds. Proving that rewrite rules do not 
exist is also possible within 4.1 seconds. For RV32IF, all 
the rules are found within 22 seconds, with most taking less 
than 8 seconds. Proving that particular rules like i32.rem_s 
are not possible takes up to 38 seconds. RV32IF contains 
many floating point instances, each requiring an expensive new 
universally quantified variable (explained in Section II]-E). 
This can mostly explain the higher time compared to the other 
architectures. 

Some comparison instructions are impossible to implement 
in a single instruction (a fact verified by our method), so 
we searched for sequences of two instructions, by composing 
two architectures as described in Section II-B. The times 
to find rewrites for these comparison operations for each of 


Time to Generate All Rewrite Rules 


mE SAT (2 Instructions) 
mE SAT (1 Instruction) 
EE UNSAT 


Time (minutes) 
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Fig. 3: Total time to generate single instruction SAT results, 
2 instruction SAT results, and UNSAT results of 37 rewrite 
rules for each RISC-V architecture. 


the RISC-V architectures are shown in Table II. We are 
able synthesize these rules for RV32I and RV32IX in a few 
seconds. For RV32IM and RV32IF, which are significantly 
more complex circuits, synthesis times are under 31 and 81 
seconds, respectively. We note that verifying a rewrite rule 
can be done nearly instantaneously (well under a second for 
any rule we discovered). Therefore, given the knowledge that 
RV32I is a subset of RV32IF, one could simply verify that 
rules generated for RV32I work for RV32IF in order to avoid 
the longer synthesis times which arise from the complexity of 
floating point. 

Similar to Table II, in Figure 3 we show the time spent 
synthesizing rewrite rules or proving no rewrite rule exists for 
each RISC-V architecture. This includes the time for proving 
that the instructions in Table II cannot be accomplished in one 
instruction, and the time for synthesizing each two-instruction 
rule. Results from targeting RV32I and RV32IX are fast, 
each taking less than a minute. Results targeting RV32IM 
and RV32IF are slower at around 3 minutes and 9 minutes, 
respectively, but this is still significantly faster than manually 
writing these rules. 


V. RELATED WORK 


In recent years, many new techniques and tools have been 
developed for synthesis based on SAT and SMT solving [30], 
[31], [34], [50], [52], [51], [53]s. In the SKETCH language, 
for example, a programmer provides a specification and a 
partial program with “holes” [52], [51]. SKETCH attempts 
to fill these holes so that the complete program matches the 
specification. However, due to the nuances of targeting RTL, 
we found that a direct encoding into SMT formulas was 
more flexible and convenient than using an existing program 
synthesis system. One promising approach is Syntax-guided 
synthesis (SyGuS) [5], [6], [48], in which a program must 
be synthesized within a given grammar to meet a given 
specification (the grammar and specification are given using a 
variant of the SMT-LIB language [8]). Exploring possible uses 
of SyGuS in this context is an interesting avenue for future 
work. 

Perhaps more relevant is the work of Dias and Ramsey [18], 
[19], [49], who, in their 2006 work, propose a system to 
synthesize rewrite rules using an ISA specification where 
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each instruction is specified as a distinct formula. They use 
a pattern-matched syntax tree to synthesize these rules. In 
contrast, we use SMT to find all equivalences. Further, we 
use the RTL directly rather than a manually specified enumer- 
ation of instructions. This distinction is especially important 
during design space exploration, when automating as much as 
possible is crucial. 

More recently, Buchwald, Fried, and Hack proposed a 
system which, like the work of Dias and Ramsey, synthe- 
sizes rewrite rules using an enumeration of an ISA’s instruc- 
tions [12]. However, instead of using pattern matching they 
leverage SMT to find rewrite rules for integer instructions. 
They notably lack support for floating-point, which we can 
handle efficiently. One interesting contribution of their work is 
the ability to synthesize control flow instructions by modeling 
them as a set of Boolean functions which indicate which 
branch target was taken. Applying a similar method in our 
approach is an interesting avenue for future work. 


VI. DISCUSSION AND FUTURE WORK 


Our technique for rewrite rule synthesis is a step towards 
automatically synthesizing a complete code generator from 
an RTL description of the target architecture. Future work 
includes two directions: synthesizing more kinds of rewrite 
rules, and targeting more expressive RTL. Pipelined architec- 
tures could leverage unpipelining [38] or unrolling [11] (with 
side conditions to ensure progress) to generate a model with 
the desired properties. Alternatively, if the RTL is derived from 
a high-level language, we could capture the synthesized design 
before micro-architectural details are added. 

Architects often explore many alternatives when designing 
new hardware. This is often done incrementally. They propose 
a design change, implement it, then reevaluate the efficiency. A 
major impediment to design space exploration is implementing 
the software changes needed to compile the application to the 
new accelerator. The work in this paper enables automatically 
deriving part of the code generator and is one step towards the 
goal of eventually building a complete system for rapid and 
automated design space exploration. 
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APPENDIX 


In addition to showcasing the efficiency of generating in- 
struction rewrite rules, we wrote two compilers, one targeting 
the CGRA PEs, and one targeting the RISC-V processors, 
that showcase the synthesized rewrite rules can be used in 
real application compilation. The CGRA rewrite rule synthesis 
and compiler are actively being used in production in our lab’s 
efforts to design and run applications on our CGRA [7]. 

1) CGRA Compilation Results: We apply our synthesized 
rewrite rules to a number of image processing applications 
written in the domain-specific language Halide [47]. Halide 
is generally amenable to hardware acceleration [46], [57], 
making it a suitable source language to target our PE designs. 
The standard Halide compiler first lowers the program to its 
internal IR consisting of multiple computational kernels and 
structured for-loops [47]. Each kernel is further lowered to 
a dependency graph of CoreIR instructions. Our instruction 
selector then applies the synthesized rewrite rules to transform 
each kernel into a graph of PE instructions. 

We selected four typical image processing applications: (1) 
Gaussian blur, an algorithm for blurring an image using con- 
volution with a Gaussian kernel [44]; (2) A bfloat16 version of 
Gaussian blur; (3) Harris corner detection, which finds sharp 
corners of objects in images [32]; and (4) A complete camera 
pipeline, which is representative of end-to-end processing of 
raw sensor data to a final image [35]. The camera pipeline 
includes kernels for hot pixel suppression, demosaicing, and 
color correction. These applications have 3, 3, 10, and 14 
distinct kernels, respectively, and the number of operations 
within a kernel range from just a single operation to almost 
200. 

We compare our synthesized rewrite rules to an existing 
hand-coded set for PE-F. Table IV shows the instruction counts 
for each application with each set of rewrite rules. The code 
sizes for the synthesized rewrite rules are the same or better 
than the sizes of those from the hand-coded rules. For Harris, 
which contains the i16.umin and i16.smin operations, 
the hand-coded result uses 2 instructions® (116.1lte and 
i16.mux), but since we synthesize rewrite rules directly for 
i16.umin and i16.smin, the instruction selector produces 
more efficient code. Additionally, our instruction selector 
works for the new PE variants automatically. It uses the 
i16.absd instruction in both PE-B and PE-C, reducing 
the total instructions for Camera. Similarly, it leverages the 
i16.const-fma instruction in PE-C, greatly reducing the 
total instructions for Gaussian and slightly reducing them for 
Camera. This example demonstrates that it is easy to extend 
PEs with new instructions and automatically update the set of 
valid rewrite rules. 

2) RISC-V Compilation Results: We also show that we can 
compile branch-free C programs using our synthesized rewrite 
rules. This approach has been used to evaluate other code 


We are unsure at this point whether this is a result of the architecture being 
updated without a corresponding update to the hand-coded rewrite rules or 
whether this rule was just overlooked by the original author of the tool. In 
any event this sort of mistake motivates this paper. 


re Synthesized 
Application PE-F PE-F | PE-A | PE-B | PE-C 
Gaussian i116 20 20 20 20 12 
Gaussian 
bfloat!6 20 20 N/A N/A N/A 
Harris 116 109 109 108 108 
Camera 343 338 338 309 308 


TABLE IV: The number of PE instructions required for 
four Halide applications: Camera, Gaussian integer, Gaussian 
bfloat16, and Harris. The applications are compiled using both 
the hand-coded rewrite rules and the synthesized ones. 


Benchmark | Synthesized | gcc -O0 | gcc -O1 
Pl 3 16 3 
P2 3 16 3 
P3 3 16 3 
P4 3 16 3 
P5 3 16 3 
P6 3 16 3 
P7 5 19 4 
P8 5 19 4 
P9 4 20 4 
P10 4 24 5 
P11 4 22 4 
P12 5 23 5 
P13 5 22 4 
P14 5 25 5 
P15 5 23 5 
P16 10 29 6 
P17 6 23 5 
P18 4* 36 7 
P19 6 35 J: 
P20 9 35 8 
P21 25 50 13 
P22 26 39 11 
P23 32 50 15 
P24 18 50 12 
P25 27 72 19 


TABLE V: Number of RISC-V RV32IM instructions on 25 
Hacker’s Delight programs (P1-P25). We show our system 
versus gcc with two levels of optimization. *The compilation 
of P18 to WebAssembly generated a 132. popcnt and hence 
could only be compiled to RV32IX. 


generators [50], [30]. Specifically, we compile 25 Hacker’s 
Delight [56] programs. We use C implementations from Gul- 
wani et al. [30]. 

We compile C to stack-machine WebAssembly byte code 
using Emscripten [58] (using emcc —Os). We then transform 
the resulting code into a basic block by abstract interpreta- 
tion on a virtual stack [22], implemented with a modified 
WebAssembly interpreter. 

We apply type legalization [36] to decompose i32 con- 
stants into i12 and i20 constants. These bit-widths are 
chosen as they are the bit-width of immediate fields in the 
RISC-V ISA. Instruction selection is then applied using the 
synthesized rewrite rules. Next, we perform basic instruction 
scheduling and register allocation, and finally we assemble the 
instructions into RISC-V byte code [28], [14]. 


We compare the code we generate to that produced 
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by gcc (riscv64-unknown-elf-gcec -march=rv32g 
-mabi=ilp32). The gcc -00 option uses the stack to store 
intermediates so our code size is better, while gcc -—O1 uses 
multiple basic blocks to decrease code size, which we do not 
support. 

Table V shows a comparison of the number of instructions 
generated from our compiler versus a RISC-V gcc compiler 
for each Hacker’s Delight program. For P21-P25, gcc -01 
generates small code size by using branching code, an op- 
timization we do not implement. P18 uses a i32.popent 
in the generated WebAssembly. When targeting RV32IX we 
can leverage the custom instructions to compile program P18 
using only 4 instructions. 
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Abstract—Error-correction codes (ECCs) are becoming a de 
rigueur feature in modern memory subsystems, as it becomes 
increasingly important to safeguard data against random bit 
corruption. ECC architecture constantly evolves towards designs 
that leverage complex mathematics to minimize check-bits and 
maximize the number of data bits protected, as a result of which 
subtle bugs may be introduced into the design. These algorithms 
traverse a vast data space and are subject to corner case bugs 
which are hard to catch through constraint-based randomized 
testing. This necessitates formal verification of ECC designs to 
assure correctness of the algorithm and its hardware implementa- 
tion. In this paper we present a technique of representing various 
ECC algorithm outputs as Boolean equations in the form of 
Boolean Decision Diagrams (BDDs) to facilitate reasoning about 
the algorithms. We also discuss the counting and generation of 
examples from the BDD representations and how it aids in tuning 
ECC algorithms for performance and security. Additionally, we 
display the use of Symbolic Trajectory Evaluation (STE) to prove 
the correctness of register transfer level (RTL) implementations 
of these algorithms. We discuss the scaling up of this verification 
methodology, using different complexity and convergence tech- 
niques. We apply these techniques to a number of complex ECC 
designs at Intel and showcase their efficacy on several categories 
of bugs. 

Index Terms—error correction codes, formal verification, sym- 
bolic simulation, binary decision diagrams 


I. INTRODUCTION 


With the ever-increasing capacity demands, memories are 
becoming denser and are more susceptible to soft errors. Error 
Correction Codes (ECCs) provide resiliency to the memory 
cell against errors due to cosmic rays, impurities during man- 
ufacturing, and other causes. Recent moves by chip manufac- 
turers to extend ECC support to consumer processors, which 
was once limited to servers, emphasizes the universal necessity 
of ECCs. If the ECC fails, it will result in incorrect data getting 
read; in a safety-critical system, this can be catastrophic. ECC 


! Formal Verification Central Technical Office 


Intel provides these materials as-is, with no express or implied warranties. 
Intel processors might contain design defects or errors known as errata, which 
might cause the product to deviate from published specifications. Intel and 
the Intel logo are trademarks of Intel Corporation. Other names and brands 
might be claimed as the property of others. 


&) https://doi.org/10.34727/2022/isbn.978-3-85448-053-2_21 


Mihir Parang Mehta 
FVCTO! 
Intel Corporation 
Santa Clara, CA, USA 
mihirl .mehta@ intel.com 


Vaibhav Singh 
FVCTO! 
Intel Corporation 
Portland, OR, USA 
vaibhav.singh @intel.com 


designs work by carefully adding data redundancy in the form 
of some check-bits to the data-stream while storing it. These 
check-bits and data-bits, which may have been corrupted 
during storage, are then used together to retrieve the original 
data. Though helpful in providing memory-protection, ECC 
designs are difficult to verify. ECC verification can be a 
challenge both for dynamic validation (DV) from the coverage 
perspective, and for formal verification (FV) from the con- 
vergence perspective. Consider the example of a Triple Error 
Correction Quadruple Error Detection (TECQED) design with 
512 data-bits, 1-bit Metadata and 31 check-bits. Pre-silicon 
dynamic validation would require 4.87e163 input patterns to 
fully validate the design, a nearly impossible task, and post- 
silicon issues are discovered very late in the design cycle, not 
providing enough time to determine a robust fix. Owing to the 
complex equations generally used in ECC logic, these designs 
are not tractable by different industry standard FV tools. Most 
commercial model-checking tools are better suited to solve 
control path challenges and falter in achieving convergence on 
big datapath designs. Commercial datapath FV tools tend to 
rely on structural similarities of the reference specification and 
the implementation. Such similarities are absent in the case of 
closed-box ECC verification, where the specification is just a 
property stating, “the resultant data equals the received data”. 

This paper shows our results in verifying diverse ECC 
algorithms and designs, across a range of datacenter and 
consumer processors, using an Intel-internal datapath tool, 
Forte/rSTE [3], [12]. The complexity of these verification tasks 
varied from a 64-bit corruption on a Dynamic Random-Access 
Memory (DRAM) device in a memory controller to a 512b- 
sized TECQED ECC in a data cache. We analyze our results 
with respect to different verification parameters (complexity, 
coverage, runtimes etc.) and compare with commercial tools. 

In the remainder of this paper, we briefly introduce error 
correction (section II), and the underlying proof methodology 
with the Forte tool (section MI). We explain the verification 
setup, and the properties we prove (section IV) on ECCs. We 
evaluate the results of these verification activities (section V) 
and sum up our contributions (section VII). 
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Fig. 1: ECC Writer and Reader 


II. ERROR CORRECTION CODES 


ECC functionality is usually implemented in hardware 
designs as two modules, a Writer and a Reader (Fig. 1). 
Using a generator function g, the Writer generates, from Data 
Dw, a codeword CW,, which consists of D,, appended with 
check-bits. There are two types of check-bits, locator bits Lw 
and parity bits P,,. Once CW, is written to memory, it is 
subject to zero or more bits of corruption. Within the reader, 
the extractor function h computes the locator syndrome LS 
and parity syndrome PS. These syndromes are calculated 
by re-computing the locator bits L’. and the parity bits P’ 
and comparing them to their values P, and L, in the read 
codeword CW,.. Using these syndromes, the function f can 
determine the presence of the error and determine the location 
of the error, finally returning corrected data D’ and error signal 
Err. Err can be: 


1) No Error (NE): No corruption detected in CW,.. 

2) Correctable Error (CE): Corruption detected in CW, and 
fixed. Thus, output data D’ = Dy. 

3) Detectable but Uncorrectable Error (DUE): CW, cor- 
ruption detected; but correction outside algorithm capa- 
bilities. Thus, D’ 4 Dy. 

Reliability, Availability, and Serviceability (RAS), feature sets 
that are associated with system resiliency in the presence of 
hardware faults, impose requirements that vary across designs. 
For example, some SRAM (Static Random-Access Memory) 
cache designs may need protection from bit-flips that can 
randomly happen at any bit-position in the cache-line, while 
other designs, such as DRAM, may need protection on groups 
of neighboring bits, which we will refer to as bit-groups. RAS 


requirements shape the choice of the ECC algorithm. These 
algorithms are based on the mathematical theories of Galois 
Extensions (Bose—Chaudhuri—Hocquenghem Codes) [4], [10], 
Lagrange Interpolation (Reed Solomon Codes) [15], and finite 
fields. 


III. SYMBOLIC SIMULATION AND FORTE TOOLSET 


Symbolic simulation extends standard digital circuit 
simulation with symbolic representations of values, covering 
behaviors of a circuit for all possible instantiations of the 
symbolic values in a single simulation. Used as a formal 
verification method, symbolic simulation is algorithmically 
simple and intuitive, which enables precise analysis and fine- 
grained mitigation of computational complexity, allowing the 
method to handle circuits that are above the capacity of 
standard formal model checking tools. Symbolic simulation 
excels in verification of deep targeted properties of fixed- 
length pipelines, in particular arithmetic and other datapath 
circuits. It has been the main vehicle for Intel arithmetic formal 
verification for over twenty years, and most arithmetic execu- 
tion units of Intel processor designs have been exhaustively 
verified using it [3], [12]. It is the primary engine embedded 
in Intel’s proprietary Forte/rSTE toolset. Symbolic simulation 
was first applied to ECC verification in 2005. Gradually, this 
application found its place in Server Memory Controller ECC 
(MC ECC) verification arsenal. 

In a symbolic simulator the input stimulus may contain 
symbolic variables in addition to the concrete Boolean values 
0, 1 and X. These symbolic variables are names of values, 
denoting sets of concrete values. The values of the internal 
signals computed in the simulation are then structural logical 
expressions on the symbolic variables on the inputs. For 
example, in a bit-level symbolic simulator, a single symbolic 
variable a corresponds to the set of Boolean values consisting 
of both 0 and 1, and if stimulus to a symbolic simulation trace 
contains the variables a, b, and c, the internal signals might 
carry values like a&b or a+(b&c). The symbolic expressions 
in a simulation are commonly encoded using Binary Decision 
Diagrams (BDDs) [5]. 

The limits of computational capacity are the limits between 
what can and cannot be verified in practice. When attempting 
to resolve a capacity challenge, the crucial difference between 
symbolic simulation and other formal verification methods is 
that in symbolic simulation a capacity problem is extremely 
concrete. It manifests itself as a symbolic expression (BDD) 
that is too large, associated with a particular node and time 
in the simulation. This concreteness allows a user to analyze, 
understand and resolve the problem with a greater degree of 
precision than other methods of verification. This amenability 
to precise performance analysis is a key differentiator en- 
abling the success of symbolic simulation. Direct user-level 
access to BDDs also allows advanced complexity management 
techniques, such as parametric substitutions and symbolic 
indexing, as well as automated analysis of the logical contents 
of a computation, for example, counting the precise number 
of input vectors satisfying or violating a given property. 
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In the Forte/rSTE toolset the base symbolic simulator STE 
is embedded in a code layer called relational STE (rSTE) in 
the context of a full-fledged functional programming language. 
Common computational complexity reduction techniques, in- 
cluding weakening, parametric substitution, etc., are made 
easily accessible to the user through programmable options 
to the tool. The framework also provides sophisticated debug 
support, breakpoints, waveform and circuit visualization, etc., 
to enable users to quickly focus on usual verification problems. 
The full programmability of the tool allows users to write 
reusable verification recipes that automate and structure shared 
or repeated tasks. 

An important aspect of the verification toolset is that it pro- 
vides a general symbolic computation capacity for Booleans. 
Not only can circuits be simulated with symbolic values, but 
any user-written program operating on Boolean data can be 
symbolically computed. This feature is very useful for multiple 
purposes: ad hoc programmatic analysis of failures, breaking 
symbolic computations into parts to analyze complexity issues, 
and early algorithm experiments prior to the existence of 
hardware implementations of those algorithms. 


IV. ECC FORMAL VERIFICATION 


The verification setup for ECC FV involves connecting the 
Writer and the Reader, as shown in Fig. 3, abstracting out 
the storage component which usually sits in between these 
two blocks in real designs and replacing it with a corruption 
model. This model explicitly adds the effect of corruption on 
the codeword CW,, generated by the Writer before it is fed 
to the Reader. 

In the setup described in Fig. 3, there are two inputs Dw 
and C. For symbolic analysis of the logic, we can assume 
these inputs to be symbolic variables instead of fixed stream 
of Os and ls, representing all values in the input space. 
Symbolic simulation then traverses the design, transforming 
input variables as BDDs in accordance with the design’s logic, 
and finally makes the transformed BDDs available at outputs 
D' and Err’. Correctness is then evaluated as a comparison 
between the output BDDs and the input BDDs under specific 
assumptions on the corruption. 

For an ECC to guarantee correction of up to n bits/bit- 
groups and detection of up to n + 1 bits/bit-groups of corrup- 
tion, the following must hold: 

e Property 1: (Countbits(C) = 0) > NE and D’ = Dy 

e Property 2: 0 < Countbits(C) <= n > CE and D' = 

Dy 
e Property 3: (Countbits(C) = n+ 1) = DUE and no 
guarantee on D’ 

If the number of corrupted bits/bit-groups exceeds n + 1, 
the algorithm makes no claims. For Single Error Correction 
Double Error Detection (SECDED), n = 1; for Double Error 
Correction Triple Error Detection (DECTED); n = 2 and for 
TECQED n = 3. DRAM ECCs employ custom algorithms 
at the level of devices, groups of bits of size 32 or 64, on a 
DIMM (dual inline memory module). The levels of protection 


provided by DRAM ECCs include full device protection, half 
device protection, and column protection. 

In properties 1 to 3, it must be noted that the conditions 
NE, CE, and DUE are mutually exclusive and exhaustive. 
Different circuits implement this differently, but regardless it 
is necessary to prove mutual exclusiveness and exhaustivity. 
A circuit may encode 2 bits such that 00 is NE, 01 CE and 10 
is DUE. In such a case we will need to show that 11 can not 
be computed. In other cases, each type of error is indicated 
by a separate signal, in which case we will need to show that 
these signals are mutexed. Usually, though, circuits indicate 
whether data was corrected, or not, with just one signal. If 
this signal is 1, then it is DUE; if 0, it is CE or NE. We will 
need to show that none of these three conditions overlap. 


A. ECC Implementation Verification 


Using the symbolic simulator of Forte/rSTE toolset, the 
correctness of ECC designs can be ascertained without any 
reference to algorithms or design internals. This gives this 
technique a clear edge over other datapath FV tools which 
usually depend on a high-level model (HLM) against which 
an equivalence check is performed. Such HLMs are them- 
selves prone to error and may incorporate an error which 
is also present in the design, in which circumstance a full 
equivalence check will nonetheless mask the bug. Moreover, 
such HLMs may need frequent remodeling in tandem with 
algorithm changes, which occur on a regular basis in the 
current landscape where ECC algorithms are continuously 
tuned in response to performance and security requirements. 

To understand the nature of this verification process, let 
us take an example SECDED design protecting 4 bits of 
data (D[O]—D[3]) using 4 check-bits. The corruption vector 
(C[0]—C[7]) represents corruption that can happen at any bit 
position of the 8 bit codeword (data and check-bits). After 
symbolic computation of BDDs at each relevant node and 
times of interest, the BDD at the output port ‘NE’, which 
indicates absence of corruption on read data, may look like 
the BDD in Fig. 2 (a). Importantly, this BDD only makes 
reference to corruption bits, although the symbolic simulation 
accounts for fully symbolic data bits. This suggests that the 
symbolic condition for ‘NE’ depends only on corruption bits 
and is independent of the data bits. It can also be noted that 
in this BDD there are several paths that lead to the terminal 
node ‘T’, while the naive expectation would be for a single 
path to reach this terminal i.e., the no-corruption path. This 
is due to the fact that ECC algorithms are constructed to 
guarantee error correction and detection up to a maximum 
bound of corruption, while the corruption vector that we 
considered allows corruption on every bit of codeword i.e., 
up to 8 bits of corruption. Therefore, to verify the algorithm’s 
properties, we must evaluate this BDD under the implication 
of the max-bound condition. Forte provides debug hooks that 
allows users to access the BDDs at different design nodes 
at various times, thus the ‘NE’ BDD can be extracted and 
evaluated for satisfiability using simple Forte commands when 
Countbits(C) < 2. Under this condition, property 1 is 
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Fig. 2: BDD for NE in Example 
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substantiated and the only satisfiable path is the one where 
C[0]-C[7] are all false. Symbolic analysis can be done in a 
similar fashion on other properties. 

We saw that a simple 4 bit SECDED could result in a 44- 
node BDD in Fig. 2 (a) for an error signal in the circuit. 
Computing and storing BDDs of this kind is a likely limiting 
factor as design complexity increases. By means of various 
techniques described below, we could limit the BDD sizes 
to smaller bounds and scale this technique to designs where 
commercial datapath tools failed to converge. 

1) Parametric Substitution: In many circumstances, sym- 
bolically simulating for a subset of data, i.e., data under 
a specified condition, is more efficient than symbolically 
simulating with unconstrained data. In such circumstances, 
parametric substitution [2] is very effective. A generic cor- 
rectness statement of a design can be represented as: 


P(x) + Q(x) 


Where P is a constraint on the data space, x is a vector 
of BDD variables, and Q is a function that carries out 
symbolic simulation. Under parametric substitution, we use a 


function param to compute a parametrized functional vector 
representation of P and rewrite the correctness statement as: 


Q(param(P(a)) 


As an example, Fig. 2 (a) depicts the BDD for the No Error 
(NE) signal of the 4-bit SECDED design, when computed in a 
simulation with fully unconstrained values. This BDD captures 
the behavior of the design for any number of corruptions from 
zero to eight. However, the design is only expected to produce 
reasonable output when the number or corrupted bits is at most 
two, in other words when the condition Countbits(C) < 2 
is true. We can compute a parametric substitution from this 
condition, and instead of simulating the system with fully 
unconstrained symbolic corruption bits, we can simulate it 
with small BDD’s for the corruption bits, restricting the behav- 
ior only to the interesting cases. Conceptually, the parametric 
substitution produces BDD’s for the corruption bits that allows 
the first two corruption bits to have any values, but any 
subsequent bits can only be high if at most one higher bit is 
already high. In the resulting simulation, the BDD for the No 
Error (NE) signal is as depicted in Fig. 2 (b), a considerable 
simplification when contrasted with the general case. 


2) Case-Splitting: With case-splitting, we decompose the 
data space into a number of sets and separately verify the 
circuit for each set. This reduces the BDD complexity and 
search space for each case in a divide-and-conquer fashion. 
ECCs naturally lend themselves to a case-split on the number 
of bits of corruption that are allowed. For example, a SECDED 
design can be decomposed into 3 cases: no corruption, 1b 
corruption, and 2b corruption. Parametric substitution of the 
case constraint will lead to even smaller BDDs. In the case of 
the example illustrated in Fig. 2, it will lead to a zero-sized 
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BDD with only a terminal vertex “True” or “False” for the 
signal ‘NE’. 

Further case-splitting can be done based on the locations 
of the (one or more) corruption bits. This has been essential 
in our verification of 512-bit TECQED designs, as it made 
convergence of the proof possible. In addition, case-splitting 
is useful towards reducing the runtimes of existing proofs by 
means of parallel processing. 


3) Symbolic Indexing: Symbolic Indexing [1] is an efficient 
technique that can logarithmically scale down the number 
of variables a BDD is dependent on. Taking the example 
of 4-bit-SECDED, if we replace the 8-bit corruption vector 
(C[O0]—C[7]) with two vectors (CI1[0]-CI1[2]) and (CI2[0]- 
CI2[2]), where the value CII gives the index of first bit that 
is corrupted and CI2 gives the index of second corrupted 
bit, then the same symbolic corruption information can be 
relayed to the simulator using 6 variables instead of original 8. 
Generally speaking, a symbolic corruption on an ECC design 
with codeword length n and up to k bits of corruption can 
be represented using k x logg(n) variables using symbolic 
indexing, which would otherwise require n variables. This 
state space reduction becomes all the more important as we 
move to larger designs such as 4096-bit-SECDED, where this 
technique allows use of two 13-bit corruption-index vectors 
instead of a 4110-bit corruption vector. 


4) Variable Ordering: BDD size is very sensitive to its 
variable order [7]. Variable order of a BDD determines the 
order in which variables will appear for all its node-traversal 
paths. The optimal variable order is required to ease BDD 
computations on bigger circuits like memory controllers where 
one design may support multiple ECC schemes. In verifica- 
tion of such designs, it is advisable to put control variables 
before data variables. This is because the control variables 
may choose a completely different mode of operation in the 
circuit; and having them at the top of the BDD tree simplifies 
the branches by preventing a commingling of different ECC 
schemes. For example, variables on signals that select the ECC 
mode, or signals that are used for configuration settings such 
as error masking, should take precedence in ordering relative 
to variables for corruption and data. 


5) Dynamic Weakening: Symbolic simulation on ECC de- 
signs may sometimes encounter a BDD blow-up. Forte assists 
in investigating and resolving such a bottleneck through dy- 
namic weakening. The user can provide a maximum bound 
of BDD limit, and whenever BDD size at an internal node 
during the symbolic simulation exceeds the provided limit, 
tool automatically ‘weakens’ that node i.e., replaces that BDD 
with an ‘X’. This new value is then propagated through the 
circuit simulation. If the weakened node was irrelevant to the 
final output computation, then it saves unnecessary simulation 
on that path, else the X propagation reaches the output nodes. 
In these cases, the BDD representation at output node can 
be of form BDD, + X(BDDg), where BDD, represents 


Reader 


Writer 


Fig. 4: Architectural ECC FV 


the variable-assignments that give concrete values | and 0 to 
the output and BDDp represents the variable assignments 
that can lead to X. Forte’s schematic viewer enables chasing 
this X and determining the cause of the divergence. The tool 
also facilitates substitution of variables with random example 
values. This makes sample cases more concrete and easier to 
debug. 


B. ECC Architectural Verification 


Forte can also be used to check algorithm architecture, in ad- 
dition to its use in closed-box verification of design properties. 
This mode, however, does need algorithm understanding and 
modeling the Writer and Reader parts of algorithms as HLMs, 
but the goal remains the same i.e., checking overall correctness 
of algorithm by means of property checking. This is done by 
using a verification setup similar to the design verification, 
only replacing the Writer and Reader design blocks, as shown 
in Fig. 4, with their HLMs written in Forte’s functional 
language reFIECt [9]. Verification tasks of this nature, instead 
of using the symbolic simulation capability of Forte, use its 
symbolic computation feature. In a manner akin to abstract 
interpretation [6], the input variables are propagated through 
the logical functions present in the HLM, undergoing BDD 
transformations at each function. Finally, BDDs are derived at 
the output of the HLM, which can then be used for reasoning 
about the correction and detection properties of the ECC 
algorithm. This architectural verification is independent of 
the design, and in practice it is often carried out before the 
algorithm is implemented in RTL. This shortens the feedback 
loop of design and verification, thus reducing time to market 
for such designs. 


C. Counting and Enumerating Error Patterns 


In modern server designs, some algorithms provide protec- 
tion of a bit-group within specific published bounds. Design 
pressures to add metadata bits to the bit group lead to 
customizations which reduce the number of check-bits and 
result in such a lower bound being chosen over a guarantee 
of full correction. For example, a customization to include 
directory bits, poison bits, and tag bits (i.e., metadata) may 
lead to an algorithm which claims, “100% detection, and better 
than 99.999% correction.” This claim implies that fewer than 
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0.001% of all possible block corruption patterns can lead to a 
DUE. This performance-accuracy tradeoff makes verification 
of this claim complex. In contrast to the properties explained 
earlier in this section, which were of the nature “under the 
given conditions, BDDs on the outputs that indicate error 
must evaluate to True or False”, our claim now involves 
exact counting of the paths that lead to the terminal vertices. 
Additionally, an algorithm may make a conditional claim such 
as “Errors that fall on both right and left half of device are 
outside scope of ECC and are not corrected but detected for 
~99.999% of error patterns.” Such a claim, in general, relates 
several design-outputs under a specific corruption condition. 
This claim bounds the number of memory failures that can go 
undetected, also known as SDCs (Silent Data Corruptions). 
To verify this, we need to count all corruption variable 
assignments under which output BDD for DUE error signal 
evaluates to False, but output data D’ is not equal to write 
data D.,. Thus, the property to be checked becomes: 


satCount(Cond > =DUE & (D' 4 Dy)) <x 


Here, Cond is the corruption condition under which count- 
ing is performed, x is an upper bound on the number of 
expected SDCs, and satCount is a count of the number of 
satisfying assignments to a given formula. 

The corruption condition and the DUE/SDC conditions can 
be composed together to form a new BDD, and we can 
count the number of satisfying instances through procedures 
written in reFI€Ct, We are also able to enumerate the corruption 
patterns that lead to SDC or DUE in addition to counting them. 
This data is sometimes needed by memory vendors and is also 
helpful during debugging to understand the frequency/location 
of failures. 

One consideration while generating these counts is the 
avoidance of duplicates, which we illustrate for the example 
of a SECDED algorithm. To count the SDC cases for 3 bit 
corruptions, we define symbolic indices p1, p2 and p3. Once 
we compute the SDC condition, there could be cases that are 
counted multiple times, such as pl = 0, p2 = 1, p3 = 2 and 
pl = 0, p2 = 2, p3 = 1. However, by assuming without loss 
of generality that p1 > p2 > p3 in the condition in the above 
expression, the counting of duplicate cases is avoided. 


V. RESULTS 


We discuss the impact seen from this verification effort 
on ECC designs of varying complexity. In the past 2 years, 
we have verified 14 ECC designs and their corresponding 
algorithms, resulting in the discovery of 48 bugs overall and 
proving the absence of bugs in customer releases. These ECCs 
are the state of the art for commercial designs. They represent 
a full range of Intel designs and were not cherry-picked for 
the case study. 

Quantitatively, Table I lays out the results of ECC FV 
spanning multiple projects and design generations. Table I 


compares ECC property checking using Forte against estab- 
lished industrial EDA (Electronics Design Automation) tools 
tuned for control-path and data-path FV. Since our BDD- 
based technique with Forte allows us to do a closed-box 
checking without reference to design internals, we explored 
the feasibility of similar testing with the EDA tools for a 
fair comparison. Tool #1 and Tool #2 in Table I can use the 
same verification setup as shown in Fig. 3 and allow the user 
to state the design properties by means of System Verilog 
Assertions (SVA). Both these tools use various engines that 
can run in parallel to achieve a concrete result and may give a 
bounded proof in case if they fail to converge. As seen from 
Table I, these tools are able to converge on small-sized designs 
based on simple ECC algorithms such as SECDED, but as the 
design size or algorithm complexity increases, convergence is 
not seen. Our techniques, however, achieve convergence in a 
matter of minutes in all of the designs under consideration. 
Tool #2 is more tuned towards datapath verification, but no 
difference was observed between Tool #1 and Tool #2 with 
respect to convergence on these tasks. Typically, datapath FV 
commercial tools do better on arithmetic designs than standard 
model checkers due to their word-level engines. However, the 
arithmetic in ECC algorithms is primarily bit-level and, as seen 
from our results, word-level processing was not particularly 
useful here. 

The size of ECC designs ranged from 3K gates (smallest) 
to over a million gates (largest). However, more than the 
design size the proof convergence depended on arithmetic 
complexity of the algorithm itself. For example, algorithm 
offering bit protection were more amenable to FV proofs com- 
pared to algorithms doing bit-group level protection. Also, the 
complexity increased as the number of bits under protection 
umbrella grew. For instance, the number of case-splits required 
to achieve proof convergence were 17K for a 512 bit TECQED 
and only 300 for a DECTED design of the same data-width, 
while none of the SECDED designs verified needed a case- 
split. Within the same algorithm category, the complexity was 
directly proportional to the data-size. So, a 32 bit SECDED is 
much easier to verify compared to a 4096 bit SECDED. 

Qualitatively, we consider it instructive to categorize the 
kinds of bugs we have found. This analysis is intended to help 
both design experts and verification experts identify common 
patterns that lead to design errors. 


A. Architectural Bugs 


Architectural FV allows early bug investigation, even before 
the implementation of an algorithm in RTL. As a result, bugs 
found in this process are prevented from ever entering the 
RTL design. This is a worthwhile exercise since the algo- 
rithms themselves are complex enough, owing to the interplay 
between different architectural features, to give rise to corner 
case bugs. For example, our recent investigation of single 
block corruption in a new ECC scheme in a memory controller 
found exactly 3 failure cases out of 18 x 23%. Previously, 
some of our FV investigations have found corner case bugs 
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7 , : Safe Engineering effort Property Convergence 
Algorithm Protection level | Data width in bits in. person-days Forte | EDA tool #1 | EDA tool #2 
Bit 1-256 <2 Yes Yes Yes 
SECDED Bit 4096 <2 Yes No No 
Bit 256 <4 Yes No No 
PECTED Bit 512 <4 Yes No No 
TECQED Bit 512 <15 Yes No No 
Custom ECC schemes for | Bit groups Continuous engagement 
DRAM device protection (16/32/64 bits) 512 across design cycle Yes No No 


TABLE I: Comparison of Property Checking with Different Formal Tools. EDA Tool #1 is a Model Checking Tool and EDA 


Tool #2 is a Commercial Datapath FV Tool 
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Fig. 5: Implementation Error Example 


that escaped testing and subsequently led to the publication of 
customer errata [11]. 


B. Implementation Errors 


Even with a correct algorithm, an implementation can be 
erroneous, due to a variety of reasons such as specification 
ambiguity. We encountered one such bug while reading parity 
from memory; while the architecture specified a column major 
order read, the RTL implementation was row major. In another 
example, a simple misconnection led to a breakdown of ECC 
functionality. This case is illustrated in Fig. 5 which shows 
the functionality of a generic ECC Reader. The Reader reads 
the codeword from memory which is comprised of Data D, 
and check-bits (i.e., locator bits L, and Parity bits P,). The 
Reader uses the read data D, and the Locator bits L, to re- 
calculate the new check-bits (L/. and P’). These recalculated 
values are then compared against the check-bits that were read 
from memory to compute syndromes that are then used to 
ascertain error presence and its correction. However, in the 
case presented in Fig. 5, instead of using original locator 
bit Lr, (green arrow indicated in Fig. 5) the recalculated 
version of L’, was used (red arrow in Fig. 5) to re-compute 
Parity bits. Due to this seemingly innocuous issue, 60% of 1b 
corruption cases that specification claimed to be correctable 
were marked uncorrectable in the design, and around 25% of 
2b corruption cases led to fatal SDCs. The timely verification 
of these designs prevented these critical bugs from making 
their way into the final products. 
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Fig. 6: Pipeline Bug Example 


C. Pipeline Bugs 


Frequently, bugs arise from pipelines where a signal was 
used at the wrong stage, or an incorrect clock-enable prevented 
the relevant values from propagating. One such failure is 
described in Fig. 6. Here 2 sets of data (Ar and B) enter 
the Reader in succession, where Ar has 1b corruption, and 
B is not corrupted. The design was expected to correct the 
corrupted Ar to its original value A and to leave B unchanged. 
However, B was changed. It was found that an internal signal, 
correctionMask, used for fixing the corruption was not updated 
while processing B due to an incorrect clock-enable, and its 
stale value resulted in a spurious correction. This behavior 
continued for a long time in the pipeline, until the next update 
of the clock enable signal. This shows that an algorithm, 
however carefully designed, can be rendered ineffective for 
a large number of corruption cases due to pipeline bugs. The 
fixing of this bug also shows the salutary effect of datapath 
FV on the surrounding control-path logic, as the closed-box 
verification approach focuses on the overall functioning of the 
design in addition to the correctness of the ECC algorithm. 


D. Specification Bugs 


The RAS capabilities of an ECC design need to be clearly 
documented for customers in an External Design Specification 
document. Thus, these specifications need to be accurate and 
must reflect exact ECC capabilities that exist in the silicon 
product. Many of the complex algorithms may not provide 
100% correction on a block, but nonetheless specify x% 
correction, y% detection, and z% silent data corruption. These 
data percentages are critical to memory vendors and need to 
be verified, but this verification is complex as it is not a simple 
true or false claim but involves exact counting of each category 
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of results. Since the number of satisfying assignments can be 
counted using symbolic representations, it can be verified that 
both the ECC algorithms and their implementations deliver the 
claims that they make in the specification. We helped in fixing 
some of these results based on our calculations. In one such 
case, an anomaly was detected on the number of DUE counts 
where the actual counts offered by algorithm differed by the 
published claims by just 7.10e-13%. 


E. Miscellaneous Bugs 


Since we analyze each ECC design in depth, we sometimes 
encounter issues such as efficiency bugs, where the design 
uses more check-bits than required by the algorithm, or 
parametrization bugs, where some design parameters are not 
passed-down correctly in the design. 


VI. RELATED WORK 


Model-checking based FV techniques have been used for 
verifying ECC designs. For example, a 128-bit TECQED ECC 
was formally verified in [13], and a 256-bit Double Error 
Correction Triple Error Detect (DECTED) ECC design was 
formally verified in [8] using a commercial model checker. 
Both these proofs converged only after a lot of design in- 
terventions and rewriting the design to make the logic fully 
combinational. These interventions need special handling, and 
one needs to make sure the bridges between these abstract 
models are verified, maintaining overall coherence. In contrast, 
our approach does not need any reduction or abstraction of 
designs. Scaling up these approaches [8], [13] to bigger ECC 
designs will be difficult as model-checking tools get fatigued 
due to the inherent complexity of ECC designs and the vast 
input space. In [13], extreme convergence steps were taken 
to conclude the proof on a 128-bit TECQED with a proof 
runtime that is counted in days, while with our technique we 
could verify a 4x data-width design (512-bit TECQED) in 
just 2 hours. 

Lvov et al. [14] verified Reed-Solomon codes by computing 
Grobner bases, using the SINGULAR arithmetic engine. Their 
proofs are independent of data width and their runtimes are 
dependent only on the number of bits corrupted. However, 
their assumption of the insufficiency of BDD-based techniques 
for ECC verification has not been borne out in Forte, which 
is capable of crunching through Boolean equations of the 
required size. This is accomplished through variable ordering 
and parametric substitution techniques, as discussed further in 
section III. As a result, ECC verification in Forte becomes a 
much simpler matter of declarative specification of the desired 
ECC properties, without reference to the underlying algebraic 
structure. 


VII. CONCLUSION 


The results discussed in this paper show the efficacy of 
our BDD-based symbolic representation in verifying properties 
of ECC designs at both the algorithmic and RTL level, 
finding bugs which would have been infeasible to find through 
testing. These techniques are scalable to large ECCs by means 


of parametric substitution and other complexity management 
techniques. The success of these techniques in discovering 
bugs on industrial designs allows the categorization of the 
most common kinds of ECC bugs, which in turn shapes the 
practice of ECC design towards avoiding these bugs from the 
very beginning. 

These techniques are valuable because they allow for a 
closed-box approach that requires neither knowledge of the de- 
sign nor an HLM for equivalence checking. Additionally, these 
Forte techniques outperform other closed-box tools. Forte 
differentiates itself here by allowing algorithmic verification, 
even in advance of the RTL being written, and by helping 
provide bounds on the incidence of certain kinds of errors. 
By facilitating efficient correctness proofs and supporting the 
development and tuning of ECC designs on multiple fronts, 
Forte-based ECC verification techniques position themselves 
to be useful well into the future. 
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Abstract—Automated reasoners, such as SAT/SMT solvers and 
first-order provers, are becoming the backbones of applications of 
formal methods, for example in automating deductive verification, 
program synthesis, and security analysis. Automation in these 
formal methods domains crucially depends on the efficiency 
of the underlying reasoners towards finding proofs and/or 
counterexamples of the task to be enforced. In order to gain 
efficiency, automated reasoners use dedicated proof rules to keep 
proof search tractable. To this end, subsumption is one of the 
most important proof rules used by automated reasoners, ranging 
from SAT solvers to first-order theorem provers and beyond. It is 
common that millions of subsumption checks are performed 
during proof search, necessitating efficient implementations. 
However, in contrast to propositional subsumption as used by 
SAT solvers and implemented using sophisticated polynomial 
algorithms, first-order subsumption in first-order theorem provers 
involves NP-complete search queries, turning the efficient use of 
first-order subsumption into a huge practical burden. In this 
paper we argue that integration of a dedicated SAT solver 
provides a remedy towards efficient implementation of first- 
order subsumption and related rules, and thus further increasing 
scalability of first-order theorem proving towards applications of 
formal methods. Our experimental results demonstrate that, by 
using a tailored SAT solver within first-order reasoning, we gain 
a large speed-up in state-of-the-art benchmarks. 

Index Terms—first-order subsumption, multi-literal matching, 
automated theorem proving, satisfiability checking 


I. INTRODUCTION 


Most formal verification approaches use automated reasoners 
in their backend to, for example, discharge verification condi- 
tions [22], [10], [15], produce/block counter-examples [20], 
[29], [1], or enforce security and privacy properties [30], 
[25], [4], [32]. All these approaches crucially depend on the 
efficiency of the underlying reasoning procedures, ranging from 
SAT/SMT solving [6], [12], [3] to first-order proving [41], [21], 
[34], [11]. In this paper we focus on automated first-order 
theorem proving with the aim of improving efficiency towards 
proving first-order (program) properties. 

The leading concept behind the proof-search algorithms 
used by state-of-the-art first-order theorem provers is satura- 
tion [34], [21]. While the concept of saturation is relatively 
unknown outside of the theorem proving community, similar 
algorithms that are used in other areas, such as Grobner basis 
computation [9], can be considered examples of saturation 
algorithms. The key idea behind saturation-based proof search 
is to reduce the problem of proving validity of a first-order 
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formula A to the problem of establishing unsatisfiability of ~A 
by using a sound inference system, most commonly using 
the superposition inference system [28]. That is, instead of 
proving A, we refute ~A, by selecting and applying inferences 
from the superposition calculus. In this paper, we focus on 
saturation algorithms using the superposition calculus. 

Saturation with Redundancy. During saturation, the first- 
order prover keeps a set of usable clauses C\,...Cx, with 
k > 0. This is the set of clauses that the prover considers as 
possible premises for inferences. After applying an inference 
with one or more usable clauses as premises, the consequence 
Cx+1 is added to the set of usable clauses. The number of 
usable clauses is an important factor for the efficiency of proof 
search. A naive saturation algorithm that keeps all derived 
clauses in the usable set would not scale in practice. One 
reason is that first-order formulas in general yield infinitely 
many consequences. For example, consider the clause 


() 


where x is a universally quantified variable ranging over the 
algebraic datatype list, where list elements are integers; 
positive is a unary predicate over 1ist such that positive(x) is 
valid iff all elements of x are non-negative integers; and reverse 
is a unary function symbol reversing a list. As such, clause (1) 
asserts that the reverse of a list x of non-negative integers is 
also a list of non-negative integers (which is clearly valid). 
Note that, when having clause (1) as a usable clause during 
proof search, the clause —positive(x) V positive(reverse” (x)) 
can be derived for any n > 1 from clause (1). Adding 
apositive(x) V positive( reverse” (x)) to the set of usable clauses 
would however blow up the search space unnecessarily. This 
is because —positive(x) V positive(reverse"(x)) is a logical 
consequence of clause (1), and hence, if a formula A can 
be proved using 7positive(x) V positive(reverse"(x)), then A 
is also provable using clause (1). Yet, storing —positive(x) V 
positive(reverse”(x)) as usable formulas is highly inefficient 
as n can be arbitrarily large. 

To avoid such and similar cases of unnecessarily increasing 
the set of usable formulas during proof search, first-order 
theorem provers implement the notion of redundancy [31], by 
extending the standard superposition calculus with term/clause 
ordering and literal selection functions. These orderings and 
selection functions are used to eliminate so-called redundant 


apositive(x) V positive (reverse(x)), 
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clauses from the search space, where redundant clauses are 
logical consequences of smaller clauses w.r.t. the considered 
ordering. In our example above, the clause -positive(x) V 
positive(reverse”(x)) would be a redundant clause as it is a logi- 
cal consequence of clause (1), with clause (1) being smaller (i.e., 
using fewer symbols) than —positive(x) V positive( reverse” (x)). 
As such, if clause (1) is already a usable clause, saturation 
algorithms implementing redundancy should ideally not store 
—positive(x) V positive (reverse”(x)) as usable clauses. To detect 
and reason about redundant clauses, saturation algorithms with 
redundancy extend the superposition inference system with 
so-called simplification rules. Simplification rules do not add 
new formulas to the set of (usable) clauses in the search space, 
but instead simplify and/or delete redundant formulas from the 
search space, without destroying the refutational completeness 
of superposition: if a formula A is valid, then =A can be refuted 
using the superposition calculus extended with simplification 
rules. In our example above, this means that if ~A can be 
refuted using —positive(x) V positive (reverse” (x)), then =A can 
be refuted in the superposition calculus extended with simplifi- 
cation rules, without using —positive(x) V positive(reverse” (x)) 
but using clause (1) instead. 

Ensuring that simplification rules are applied efficiently for 

eliminating redundant clauses is, however, not trivial. In this 
paper, we show that SAT-based approaches can be used to 
identify the application of simplification rules during saturation, 
improving thus the efficiency of saturation algorithms imple- 
menting the superposition calculus extended with simplification 
rules, as discussed next. 
Subsumption for Effective Saturation. While redundancy 
is a powerful criterion for keeping the set of clauses used 
in proof search as small as possible, establishing whether 
an arbitrary first-order formula is redundant is as hard as 
proving whether it is valid. For example, in order to derive 
that —positive(x) V positive(reverse"(x)) is redundant in our 
example above, the prover should establish (among other 
conditions) that it is a logical consequence of (1), which 
essentially requires proving based on superposition. To reduce 
the burden of proving redundancy, first-order provers implement 
sufficient conditions towards deriving redundancy, so that 
these conditions can be efficiently checked (ideally using only 
syntactic arguments, and no proofs). One such condition comes 
with the notion of subsumption, yielding one of the most 
impactful simplification rules in superposition-based theorem 
proving [2]. 

The intuition behind subsumption is that a (potentially 
large) instance of a clause C does not convey any additional 
information over C, and thus it should be avoided to have 
both C and its instance in the set of usable clauses; to this 
end, we say that the instance of C is subsumed by C. More 
formally, a clause C subsumes another clause D if there is a 
substitution o such that o-(C) is a submultiset of D!. In such 
a case, subsumption removes the subsumed clause D from 
the clause set. To continue our example above, a unit clause 


lwe consider a clause C as a multiset of its literals 


positive(reverse™(x)), with m > 1, would prevent us from 
deriving —positive(x) V positive(reverse"(x)) for any n > m, 
and hence eliminate an infinite branch of clause derivations 
from the search space. 

To detect possible inferences of subsumption and related 
rules, state-of-the-art provers use a two-step approach [35]: 
(i) retrieve a small set of candidate clauses, using literal filtering 
methods, and then (ii) check whether any of the candidate 
clauses represents an actual instance of the rule. Step (i) has 
been well-researched over the years, leading to highly efficient 
indexing solutions [27], [33], [35]. Interestingly, step (ii) has not 
received much attention, even though it is known that checking 
subsumption relations between multi-literal clauses is an NP- 
complete problem [19]. Although indexing in step (i) allows the 
first-order prover to skip step (ii) in many cases, the application 
of (ii) in the remaining cases may remain problematic (due 
to NP-hardness). For example, while profiling subsumption in 
the world-leading theorem prover VAMPIRE [21], we observed 
subsumption applications, and in particular calls to the literal- 
matching algorithm of step (ii), that consume more than 20 
seconds of running time. Given that millions of such matchings 
are performed during a typical first-order proof attempt, we 
consider such cases highly inefficient, calling for improved 
solutions towards step (ii). In this paper we address this demand 
and show that a tailored SAT-based encoding can significantly 
improve the literal matching, and thus subsumption, in first- 
order theorem proving. 

Our Contributions. In this paper, we bring the following main 
contributions. 


(1) We propose a SAT-based encoding for capturing potential 
applications of subsumption in first-order theorem proving 
(Section IHI). A solution to our SAT-based encoding gives a 
concrete application of subsumption, allowing the first-order 
prover to apply that instance of subsumption as a simplification 
rule during saturation. Our encoding uses so-called substitution 
constraints to formalize matching of literals within the premises 
(i.e., subset relation among literals of premises). Our encoding 
can be extended to other simplification rules, in particular when 
applying simplifications using the combination of subsumption 
with binary resolution (i.e., subsumption resolution). 


(2) We introduce a lean SAT solving approach tailored to 
substitution constraints, by adjusting unit propagation and 
conflict resolution towards efficient handling of such constraints. 
(Section IV). We introduce a tailored encoding of substitution 
constraints in SAT solving, advocating the direct use of our 
SAT solver for deciding application of subsumption within 
first-order proving. 


(3) We implemented our SAT-based subsumption approach as 
a new SAT solver in the VAMPIRE theorem prover (Section V). 
We empiricially evaluate our approach on the standard bench- 
mark library TPTP (Section VI). Our experiments demonstrate 
that using SAT solving for deciding and applying subsumption 
brings clear improvements in the saturation process of first- 
order proving, for example improving the (time) performance 
of the prover by a factor of 2. 
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II. PRELIMINARIES 


Let V denote a countably infinite set of first-order variables. 
We consider standard multi-sorted first-order logic with vari- 
ables V, and support all standard boolean connectives (see 
later) and quantifiers in the language. Throughout the paper, 
we write x, y, z for first-order variables, c, d for constants, 
f, g for function symbols, and p, q for predicates. The set 
of first-order terms F consists of variables, constants, and 
function symbols applied to other terms; we denote terms by t. 
First-order atoms, or simply just atoms, are predicates applied 
to terms. Atoms and negated atoms are also called first-order 
literals, and denoted by L, M. First-order clauses, or simply 
just clauses, are disjunctions of literals, denoted by C, D. All 
our notation throughout this paper may possibly use indices. 
A clause that consists of a single literal is called a unit clause. 
Clauses are often viewed as multisets of literals; that is, a 
clause C being Lı V L2 V... V Ln is considered to be the 
multiset {L1, Lo,..., Ln}. For example, the clause p V =q V p 
is the multiset {p, =q, p}. 

An expression F is a term, literal, or clause. We denote 
the set of variables occurring in the expression E by V (E). 
A substitution is a function o: V — T such that o (x) + x 
only for finitely many x € V. The function o is extended 
to arbitrary expressions E by simultaneously replacing each 
variable x in E by o(x). We say an expression E; can be 
matched to expression E> if there exists a substitution o such 
that o (E1) = E2. 

Saturation and Subsumption. Most first-order theorem 
provers, see e.g. [41], [21], [34], implement saturation with 
redundancy, using the superposition calculus [2]. A clause C 
subsumes a clause D iff there exists a substitution o such 
that o(C) © D, where C and D are treated as multisets 
of literals. Subsumption is a simplification rule that deletes 
subsumed clauses from the search space during saturation. 
Subsumption gives a powerful basis for other simplification 
rules. For example, subsumption resolution [21], [34], also 
known as contextual literal cutting or self-subsuming resolution, 
is the combination of subsumption with binary resolution; 
and subsumption demodulation [16] results from combining 
subsumption with demodulation/rewriting. 

SAT Solving. Let 8 be a countably infinite set of boolean 
variables. We denote boolean variables by b, possibly with 
indices. We use the standard boolean connectives A, V, =>, 7, 
and write T for the boolean constant true as well as L for 
the boolean constant false. A boolean literal, denoted I, is a 
variable b or its negation ~b. A boolean clause is a disjunction 
of literals. As before, we drop the qualifier boolean when it is 
clear from the context. 

Modern SAT solvers are based on conflict-driven clause 
learning (CDCL) [24], with the core procedures decide, unit- 
propagate, and resolve-conflict. The solver maintains a partial 
assignment of truth values to the boolean variables. Unit 
propagation (also called boolean constraint propagation), that 
is unit-propagate in a SAT solver, propagates clauses w.r.t. the 
partial assignment. If exactly one literal / in a clause remains 


unassigned in the current assignment while all other literals 
are false, the solver sets / to true to avoid a conflict. The 
two-watched-literals scheme [26] is the standard approach for 
efficient implementation of unit propagation. 

If no propagation is possible, the solver may choose a 
currently unassigned variable b and set it to true or false; 
hence, decide in SAT solving. The number of variables in 
the current assignment that have been assigned by decision is 
called the decision level. 

If all literals in a clause are false in the current assignment, 
the solver enters conflict resolution, via the resolve-conflict 
block of SAT solving. If the current decision level is 0, the 
conflict follows unconditionally from the input clauses and 
the solver returns “unsatisfiable’ (UNSAT). Otherwise, by 
analyzing how the literals in the conflicting clause have been 
assigned, the solver may derive and learn a conflict lemma, 
undo some decisions, and continue solving. 


III. SUBSTITUTION CONSTRAINTS AND SUBSUMPTION 


Recall that a first-order clause C subsumes a clause D iff 
there exists a substitution o such that o (C) & D, where C is to 
be understood as multiset inclusion. In what follows, we refer 
by clausal subsumption between C and D to the case when 
clause C subsumes clause D. Similarly, literal subsumption 
between L and M refers to the case when literal L subsumes 
literal M. We note that deciding literal subsumption, that is 
whether a literal L subsumes a literal M, can be done in 
almost linear time, by constructing a substitution (if it exists) o 
s.t. o(L) = M; in this case, the value of a(x) is uniquely 
determined by L and M for each variable x occurring in L. 
However, when working with arbitrary, and not necessarily 
unit, clauses C, D, deciding clausal subsumption between C, D 
is NP-complete for the following reason: for each literal L; 
of C, one of the literals M;, of D needs to be chosen in such 
a way that a substitution o simultaneously matches each Li 
with its respective M;,; that is, o(L;) = M; for all i. Towards 
addressing NP-completeness of clausal subsumption, in this 
section we introdude substitution constraints (Section II-A), 
allowing us to formulate clausal subsumption as a SAT 
problem over substitution constraints (Section III-B). Based 
on this SAT-encoding of subsumption, we further present an 
effective approach towards using subsumption in saturation in 
Section IV. 


A. Substitution Constraints 


We first introduce substitution constraints to be further used 
in deciding clausal subsumption. 

Definition I (Substitution Constraints): A substitution con- 
straint T is a partial function from V to 7, denoted as 


(04,2 225.0) > (t1,...,tk), 


where k > 0, x; E€ V are pairwise different, and t; € T. The 
set dom(T) := {x,,...,xx} is called the domain of r. We 


further write T (x;) = t; for i € {1,..., k}. 
A substitution o: V — T satisfies the substitution con- 
straint T, written o ET, iff o(x;) = t; for all i € {1,..., k}. 
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Two substitution constraints I}, are compatible if there 
exists a substitution o that satisfies both I} and I, that is, if 
T(x) = [2(x) for all variables x € dom(T1) A dom(T?). 

As already discussed, literal subsumption between two 
literals L and M can easily be determined (as there is only 
one literal, i.e. L, that needs to be matched, i.e. to M). The 
substitution constraint corresponding to the literal subsumption 
between L and M is denoted by I'(L, M) and is defined below. 

Definition 2 (Substitution Constraints for Literals): Let L 
and M be two literals. If there exists a substitution o such that 
o(L) = M, the substitution constraint T (L, M) for literals L 
and M is 


T(L, M) = (x1,...,Xk) > (t,... 5 tk), 


where V(L) = {x1,...,x<} and o(x;) = t; for all i € 
{1,...,k}. Otherwise, L cannot be matched to M and the 
substitution constraint T (L, M) for literals L and M is 


T(L, M) := L. 
Example 1: Consider the following first-order literals: 


Ly = p(x, X2,%3) Lz = p(f (x2), x4, x4) 
Mı = p(f(c), 4d, y1) Mə = p(f(d),c,c) 


We obtain the following substitution constraints: 


TP(L1, M1) = (x1, x2,.x3) > (f(c), d, y1) 
T(L1, M2) = (%1,%2,%3) > (f(d),¢,c) 
T(L2,M)) =L 

T(L2, M2) = (x2, X4) > (d, c) 


The constraints I'(L1, M1) and F (L1, M2) are incompatible, as 
these constraints map, for example, xı to different values. The 
constraints [(L,, M1) and F(L2, M2) are compatible, as both 
constraints require their only shared variable x2 to be mapped 
to d. 

To encode clausal subsumption, we need to combine sub- 
stitution constraints using boolean connectives, and boolean 
variables. For this reason, we now define the semantics of 
boolean combinations of substitution constraints. 

Definition 3 (Boolean Combination of Substitution Con- 
straints): Let F be a formula using standard boolean con- 
nectives, whose atoms are boolean variables and substitution 
constraints. An interpretation J = (œ, o) for such a formula is 
a pair of a standard boolean assignment a: B — {T, L} and 
a substitution o: V >T. 

For a boolean variable b, we define J — b iff a(b) = T. 
For a substitution constraint T, we define 7 ET iff o ET. 
For formulas F with a top-level connective of A, V, —, or =, 
we define 7 — F inductively in the standard way. For boolean 
constants, 7 = T and 7 Æ L. 

Remark 1: The formula F can also be translated into an 
SMT formula using the theory of equality and uninterpreted 
functions (EUF), where substitution constraints are replaced by 
conjunctions of equality literals. Let T denote the set of terms t 
appearing on the right-hand side of some substitution constraint 


in F. We then introduce fresh constant symbols {c; | t € T}, 
and replace each substitution constraint I = (x1,...,Xķ) > 
(ti,... tk) in F by x1 =c1 A+++ Axx = cp. To obtain correct 
semantics of substitution compatibility, we also need to add 

Ct F tu, (2) 
t,uET,t+u 


asserting that constants representing different terms in F cannot 
be equal. 

However, for clausal subsumption in a first-order theorem 
prover, it is vital that the process of encoding subsumption in 
SAT, as well as the setting up of our SAT solver for handling 
this encoding are as lean as possible (see Section V). Hence, 
we did not employ a standard SMT solver with the EUF-based 
encoding discussed above, but instead opted to directly add 
support for substitution constraints to our SAT solver. The 
advantage of our SAT-based approach is that we use less 
boolean literals, and we avoid using all-different constraints 
for terms, such as (2). 


B. SAT-Encoding of Clausal Subsumption 


We now present our formalization to express clausal sub- 
sumption between clauses C and D as a SAT problem over 
substitution constraints. To this end, assume that clause C 
is Li V Lo V--: V Ly, whereas D is Mj V M2 V---V Mm. 
Recall that deciding whether C subsumes D reduces to the 
problem of deciding whether there exists a substitution o such 
that o(C) € D, where “c” denotes multiset inclusion (over 
multisets of literals). 

For arbitrary literals L; and M;, deciding the existence of a 
substitution o with o(L;) = M; can easily be done. Yet, for 
clausal subsumption we are left with the challenge of finding 
a substitution o such that, for each L;, we have one of the M; 
such that o(L;) = M;. To address this challenge, we introduce 
new boolean variables b;; to encode possible matchings of L; 
to M,, given by o (L;) = Mj. Additionally, we use Definition 2 
to derive the substitution constraints T'(L;, M;). Based on the 
boolean variables b;; and substitution constraints ['(L;, M;), 
we formalize clausal subsumption between C and D by 
ensuring its three properties: (i) each literal L; in C is matched 
to a literal M; in D, (ii) the same substitution o is used for 
each of these matchings, and (iii) Co € D is multiset inclusion. 
Our formalization of clausal subsumption between C and D is 
given as follows. 


(i) We first define the following clauses, capturing that each 
literal L; from C must be matched to (at least one) literal 
Mj; of D: 


A bit V biz V +++ V bim 


l<i<n 


(3) 


(ii) We connect the boolean variables b;; to the substitution 
constraints '(L;, M;) through the following clauses: 


N N by > Pi Mj). 


l<i<nl<j<m 


(4) 
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These clauses employ the substitution constraints 
I'(L;, M;) to ensure the same substitution o is used for 
matching L; and M; simultaneously, for all i, j. 

As clausal subsumption uses multiset inclusion over the 
respective multisets of literals of C and D, we encode the 
requirement that each literal of D may only be matched 
at most once: 


A AtMostOne(b1;,... 


l<j<m 


where AtMostOne(b1;,... 
bij,..-,Dnj are true. 

Together, the constraints (3), (4), (5) fully capture clausal 
subsumption, yielding the following result. 

Theorem I (Clausal Subsumption as SAT): Clausal subsump- 
tion between clauses C and D is given by the conjunction of 
(3), (4), and (5). That is, C subsumes D iff (3) A (4) A (5) is 
satisfiable. 

Note that for deciding clausal subsumption between C and 
D, we only need to establish satisfiability of (3) A (4) A (5) in 
Theorem 1: one substitution o such that Co C D is sufficient 
for deciding that C subsumes D, implying that D can be deleted 
from the set of usable clauses during saturation. Hence, while 
clausal subsumption (3) A (4) A (5) captures all substitutions o 
for which Co C D, for deciding whether C subsumes D we are 
interested to find only one satisfying instance of (3) A (4) A (5). 
As a result, application of clausal subsumption in saturation 
can be decided by solving the satisfiability of (3) A (4) A (5). 

Example 2: Consider the literals defined in Example 1 and 
clauses C = Lı V Lz and D = M; V M2. The encoding of clausal 
subsumption between C and D resulting from Theorem 1 is 
the conjunction of the following clauses: 


(iii) 


»Dnj)s (5) 


, bnj) is true iff zero or one of 


bii V biz 

bai V bog 

bir > (41, 2,43) > (f(c), d, y1) 
biz > (%1,%2,%3) > (f(d),¢,c) 
ba OL 

by. > (x2,x4) > (d,c) 

abi V abo) 

ab 12 V abo 


This set of clauses is satisfiable, as witnessed by the model that 
assigns bıı and b2 to true, b12 and bo, to false, and o (x1) = c, 
o (x2) = f(d), o (x3) = y1, o (x4) = c. We conclude that the 
first-order clause C subsumes D. 

Remark 2 (Subsumption Resolution): Our encoding of clausal 
subsumption can be adjusted to also decide the application 
of other simplification rules in saturation, when these rules 
implement variants of subsumption. To this end, we have 
extended the SAT encoding (3)A(4)A(5) of clausal subsumption 
to the inference rule subsumption resolution. In addition to 
clausal subsumption, subsumption resolution also uses instances 
of binary resolution. Hence, for finding substitutions o such 
that subsumption resolution between clauses C and D can be 


applied (and D deleted from the set of usable clauses), we 
extended the clauses (3) A (4) A (5) with additional constraints 
capturing application of resolution, while also adjusting the 
encoding of (3) A (4) A (5) to set inclusion between literals of 
C and D (instead of multiset inclusion from subsumption). 


Remark 3 (At-Most-One Constraints): We conclude this 
section by noting that a correct but naive solution to encode 
AtMostOne(b1;,...,4n;) in (5) would be the following: 


A bij V ab; j. (6) 


1<ij<in<n 


More efficient encodings using at-most-one constraints (see, 
e.g., [13]) can be used instead of (6). In our work however, we 
opted to add direct support for at-most-one constraints when 
reasoning about (5) (see Section IV). 


IV. EFFECTIVE SUBSUMPTION VIA LEAN SAT SOLVING 


In Section III we showed that the application of subsumption, 
as an inference rule in saturation, can be reduced to the 
satisfiability problem of the formula (3) A (4) A (5) using 
substitution constraints (Theorem 1). In this section we describe 
our approach for solving (3) A (4) A (5). 

A straightforward approach towards handling (3) A (4) A (5) 
could come with translating (3)A(4)A(5) into only propositional 
clauses; yet, such a translation would either require additional 
propositional variables to encode at-most-one constraints 
or would come with a quadratic number of propositional 
clauses [13]; similarly for substitution constraints. 

Due to the particular distribution of subsumption instances 
(see Section V), the encoding must be lightweight to be 
practically feasible. To overcome the increase in propositional 
variables/clauses to be used for deciding clausal subsumption in 
an efficient manner, we support substitution constraints (4) and 
and at-most-one constraints (5) directly in SAT solving, and 
introduce a lean SAT solving approach tailored to subsumption 
properties. In particular, we adjust unit propagation and 
conflict resolution in CDCL-based SAT solving for handling 
propositional formulas with substitution constraints. This way, 
we integrate our lean SAT solving methodology directly into the 
saturation process of first-order proving (Section V), instead of 
interfacing first-order proving with an existing off-the-shelf SAT 
solver. Such a direct integration allows us to efficiently identify 
and apply subsumption during proof search (see Section VJ). 


a) Using Substitution Constraints in SAT Solving: For 
handling substitution constraints in clausal subsumption, we 
attach a substitution constraint ['(L;,M;) to each freshly 
introduced boolean variable b;; in (3), which is equivalent 
to adding the constraint b;; — I'(Li, M;) of (4). 


b) Unit Propagation with Substitution Constraints: 
Consider now the clauses b;; — T'(L;, M;) using substitution 
constraints, with i € {1,...,n} and j € {l,...,m}, from 


164 


clausal subsumption (3) A (4) A (5). Semantically, these con- 
straints are equivalent to the following set of binary clauses: 
{>bij V aby; | i,i’ € {1.. an}, j, j’ € {1...m}, 
GA#ES), 
Ax € dom(T (Li, M;)) N dom(T (Ly, M;)) 
s.t. (Li, M;)(x) # (Ly, Mj)(x)}, 


(7) 


which intuitively encodes that no two incompatible substitution 
constraints may be true at the same time. 

In our work, instead of creating the binary clauses of (7) 
explicitly, we introduce support for substitution constraints 
as an additional (unit) propagator in SAT solving: when- 
ever a boolean variable b;; is assigned to true, our SAT 
solver processes the associated bindings for the first-order 
variables from dom(I(L;, M;)), and propagates all boolean 
variables by; to false that are associated with conflicting 
bindings for variables dom (T (L;, M;)) Ndom(T'(L;, Mj); in 
other words, all b; ; whose associated substitution constraints 
are incompatible with T'(L;, M;). This propagation is done 
exhaustively once b;; is assigned to true and before standard 
unit propagation in SAT solving would be applied. Thus we 
ensure that no conflict can occur at this point: if there were 
a conflict, that would mean a b;; with conflicting bindings 
has already been assigned to true; in this case however, we 
would have already propagated b;; to false when assigning 
bj ;. An exception in handling conflicts occurs with the initial 
propagation before starting the CDCL loop of SAT solving; 
in this case, we may get a conflict if two unit clauses with 
conflicting substitution constraints have been added, however, in 
that case the SAT solver is at decision level 0 and can terminate 
with reporting unsatisfiability (UNSAT) of (3) A (4) A (5). 

c) Conflict Resolution with Substitution Constraints: 
During conflict resolution in our SAT engine, we proceed as 
if the binary clauses (7) were part of the clause database, 
i.e., as if the binary clause -=b;; V =b; ; were the reason for 
propagating b; ;. Therefore we only need to store the literal 
bij as the reason for unit propgation. Substitution constraints 
during conflict resolution thus do not need specialized treatment 
in our SAT solving approach. 

d) At-Most-One Constraints: During unit propagation and 
conflict resolution, our at-most-one constraints (5) are treated 
as if we had the corresponding binary clauses from (6), saving 
the overhead from creating additional clauses and variables. 

Remark 4: While we presented our approach in the context 
of solving (3) A (4) A (5), our SAT solving approach naturally 
supports arbitrary boolean clauses and at-most-one constraints, 
as well as substitution constraints in the form b — I (where 
b is a boolean variable and I a substitution constraint). 


V. SAT-BASED SUBSUMPTION IN FIRST-ORDER THEOREM 
PROVING 


We implemented our lean SAT-based approach of Section IV 
as a new extension to the theorem prover VAMPIRE. While 
VAMPIRE already implements highly optimized algorithms 
for checking subsumption, these algorithms are built on a 


standard, backtracking-based search procedure: using a static 
variable ordering and limited amount of unit propagation, 
without learning from conflicts. Hence, the full power of 
SAT-based reasoning with unit propagation and conflict reso- 
lution is not yet supported for subsumption. We overcome 
this limitation by integrating our SAT-based approach for 
clausal subsumption directly in VAMPIRE. Our implementation 
consists of about 5000 lines of C++ code and is available at 
https://github.com/JakobR/vampire/tree/sat-subsumption. 

a) Implementing Subsumption: When establishing satisfi- 
ability of (3) A (4) A (5), we can observe two different types 
of subsumption instances: 

(i) easy subsumption instances, where not much SAT-based 

search is required (very few or even no decisions/conflicts), 
For such instances the overhead of setting up the clausal 
encoding of (3) A (4) A (5) largely determines the total 
running time of our SAT solver. 
hard subsumption instances, whose application is deter- 
mined by a significant number of unit propagation and/or 
conflict resolution steps in SAT solving. 
We recall that the overall goal of our work is to improve 
subsumption checking in first-order theorem proving. For this, 
we complemented VAMPIRE with a SAT-based approach to 
decide application of subsumption. Note that the majority of 
the subsumption instances encountered during a typical first- 
order proving attempt are of type (i), with instances of type (ii) 
appearing occasionally, depending on the input formula. Still, 
the total running time is often dominated by type (ii) instances, 
and these are the target of our SAT-based approach. We must 
however be careful to not become slower on type (i) instances, 
thus motivating our choice of a lean, dedicated SAT-solver 
embedded into VAMPIRE. 

In many of the trivial instances of (3) A (4) A (5), the 
unsatisfiabiliy (UNSAT) of these instances can be discovered 
already during the encoding of (3) A (4) A (5) (whenever an 
empty clause would be added). To save time on these instances, 
in our implementation we defer the construction of watch lists 
and other data structures until entering the solving loop of our 
SAT engine (if at all). 

We note that the number of subsumption instances, especially 
easy ones of type (1), during first-order proving can become 
quite large, often in the order of millions of instances in a 60s 
run of a theorem prover. Allocating and deallocating a new 
SAT solver instance for each SAT-based subsumption query 
can thus become expensive (see Section VI); therefore, in 
our implementation we keep the same solver instance around, 
and re-use it for different queries. In particular, we keep the 
memory for data structures (such as clause storage, watch lists, 
trail, and others), instead of reallocating it for each query. 

b) Unit Propagation: To achieve efficient unit prop- 
agation, our SAT solver for clausal subsumption watches 
two literals of each clause [26]. However, for at-most-one 
constraints the situation is different. Consider the constraint 
AtMostOne(/;,...,/,) for some k > 3 (note that for k < 2 we 
either drop the constraint or add a binary clause instead). As 
soon as any l; is assigned true, all /; with j + i must be false to 


(ii) 
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avoid violating the constraint, and are propagated thus. Hence, 
the solver watches all literals of at-most-one constraints. 


VI. EXPERIMENTS 


We evaluated our new SAT-based implementation for clausal 

subsumption in VAMPIRE (see Section V). In our experiments, 
we were interested (i) to measure the performance improve- 
ments we gain through our approach, as well as (ii) to assess 
the advantage of re-using our SAT solver objects, and thus 
having our SAT solver directly integrated the first-order proving 
process of VAMPIRE. 
Benchmarks. The basis for our benchmarks is formed by 
the TPTP library [36] (version 7.5.0), which is a standard 
benchmark library in the theorem proving community. The 
TPTP library contains altogether 24,098 problems in various 
languages, out of which 16,312 problems have been included 
in our evaluation of SAT-based subsumption in VAMPIRE. 
The remaining TPTP problems that we did not use for our 
experiments either use features that VAMPIRE currently does 
not support (e.g., higher-order logic with theories), or did not 
involve subsumption checks. 


Experimental Setup. All our experiments were carried out 
on a cluster at TU Wien, where the compute nodes contain 
two AMD Epyc 7502 processors, each of which has 32 CPU 
cores running at 2.5GHz. Each compute node is equipped with 
1008 GiB of physical memory that is split into eight memory 
nodes of 126 GiB each, with eight logical CPUs assigned to 
each node. We used the tool runexec from the benchmarking 
framework BENCHEXEC [5] to assign each benchmark process 
to a different CPU core and its corresponding memory node, 
while aiming to balance the load evenly across memory 
nodes. Further, we used GNU PARALLEL [38] to schedule 32 
benchmark processes in parallel. 


Experimental Results on Measuring Speed Improvements. 
We emphasize that using a SAT-based approach for deciding 
clausal subsumption will, in theory, not prove problems that 
were not provable before. If a problem is provable while using 
saturation with redundancy, and hence with subsumption, then 
it is also provable using saturation without redundancy, and vice 
versa. However, in practice, saturation with redundancy (hence 
with subsumption) will improve the prover’s performance in 
finding a proof. As such, the aim of our work is to speed up the 
application of subsumption in saturation. For this reason, we 
set up our first experiment to measure the cost of subsumption 
checks in isolation. A similar evaluation has previously been 
done for indexing techniques in first-order provers, see [27]. 
In preparation for this experiment, we ran VAMPIRE, using 
the original backtracking-based subsumption implementation, 
with a timeout of 60 seconds on each TPTP problem while 
logging each subsumption (and subsumption resolution) check 
into a file. Each of these files contains a sequence of subsump- 
tion (and subsumption resolution) checks, which we call the 
subsumption log for a problem. This preparatory step yielded 
a large number of benchmarks that are representative for the 
checks appearing during actual proof search. These benchmarks 


Figure 1. Total running time (in seconds) of backtracking-based vs. SAT-based 
subsumption, with detailed information about outliers in Table I. For marks 
below the dashed line, our SAT-based approach was faster. 
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occupy 1.75 TiB of disk space in compressed form, and contain 
approximately 114 billion subsumption checks in total. About 
0.5 % of these subsumption checks are satisfiable (561 million), 
while the rest are unsatisfiable. 

In addition to generating these benchmarks, we have profiled 
the portion of time spent by subsumption in VAMPIRE. Over 
the TPTP problems used for our experiments and a time limit 
of 60 seconds, it ranges from 0% (no subsumption checks) 
to more than 99 % (hard subsumption check), with a mean of 
46 % and the median at 53 %. 

Next, we executed the checks listed in each subsumption log 
and measured the total running times, once for the already ex- 
isting subsumption algorithm of VAMPIRE using backtracking, 
and once for our SAT-based subsumption approach in VAMPIRE. 
The subsumption checks are benchmarked in a similar way as 
they would appear during a regular prover run, i.e., with the 
same caching of intermediate results. For increased reliability, 
each measurement was performed five times, and then taking 
the arithmetic mean. 

The results of these experiments are given in Figure | and 
Table I. Each mark in Figure 1 represents one subsumption 
log from a TPTP problem, and compares the total running 
times of executing all subsumption checks contained in the 
log with the old backtracking-based algorithm vs. the new 
SAT-based algorithm. The dashed line indicates equal runtime, 
hence, our SAT-based approach was faster for marks below the 
line. In Table I, we give the cumulative times needed to set 
up the subsumption checks, to solve them, and the total time. 
Both the backtracking-based and our SAT-based subsumption 
algorithm can naturally be split up into a setup stage and a 
separate solving stage. The setup stage transforms the two 
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Table I 
RUNNING TIME OF SUBSUMPTION CHECKS 


Subsumption log Backtracking-based Subsumption SAT-based Subsumption 

for problem Setup Solve Total Setup Solve Total Aabs Arel 
GRP134-1.005 42.87 s 2.21s 45.08 s 13.87 s 2.6ls 16.48 s 28.60 s 2.74 x 
GRP396+1 67.05 s 88.65 s 155.70s 15.90s 98.01 s 113.91s 41.79 s 1.37x 
HALOO7+1 33.25 s 30.54 s 63.79 s 17.05 s 94.51 s 111.56s -47.78 s 0.57 x 
HWV056+1 26.72 s 1.01s 27.73 s 48.73 s 2.37s 51.10s -23.37 s 0.54 x 
HWV058-1 17.32 s 1.05s 18.37 s 37.57 s 0.53 s 38.10s -19.73 s 0.48 x 
HWV059- 24.21 s 0.95s 25.16s 35.79 s 0.68 s 36.48 s -11.31 s 0.69 x 
HWV060+1 16.61s 0.66 s 17.26s 35.82 0.73 s 36.55 s -19.28 s 0.47 x 
HWV086+1 17.76s 1.80s 19.57 s 50.12s 3.15s 53.27 s -33.71 s 0.37 x 
LCL662+1.020 43.78 s 1.64s 45.42 s 14.33 s 0.86s 15.19 s 30.23 s 2.99 x 
MGT038-1 13.15s 12.88 s 26.04 s 15.35s 41.33 s 56.67 s -30.64 s 0.46 x 
MGT066+1 3.45s 63.99 s 67.448 1.95s 30.87 s 32.82 s 34.63 s 2.06 x 
NLP023+ 0.08 s 154.05 s 154.13 s 0.04s 0.10s 0.14s 153.99 s 1082.84 x 
NLP023-1 0.09 s 157.46 s 157.55 s 0.05s 0.10s 0.14s 157.40 s 1087.59 x 
NLP024+1 0.08 s 88.26 s 88.34 s 0.04s 0.09 s 0.14s 88.20 s 642.68 x 
NLP024-1 0.09 s 111.20s 111.28s 0.05s 0.10s 0.15s 111.13 s 748.52 x 
PUZ073+1 24.69 s 26.60 s 51.29s 14.02 s 0.14s 14.17 s 37.12 s 3.62 x 
SYN307-1 2.09 s 53.81 s 55.90 s 1.17s 26.73 s 27.90 s 28.01 s 2.00 x 
TOP003-2 41.71s 0.43 s 42.138 48.92 s 5.13s 54.05 s -11.92 s 0.78 x 
... (416,294) ie} wie ibe Bau Ai ais ne a 
Total 16.31h 2.39h 18.70h 7.21h 1.23h 8.44h 10.27h 2.22 x 
Total (no reuse) - - - 8.08 h 2.05h 10.12h - - 
Total (VMTF) - - - 7.62h 1.40h 9.02h - - 


input clauses into constraints while the solving stage searches 
for a solution to these constraints. Additionally the table gives 
detailed data for selected outliers (problems not in the bottom- 
left of Figure 1). 

As shown in Figure 1 and Table I, our SAT-based algorithm 
for clausal subsumption gives a clear overall improvement of 
the running/proving time of VAMPIRE by a factor of 2. 

Note that for some problems, the running time for the 
backtracking-based subsumption is higher than the original 
timeout of 60s that has been used when collecting subsumption 
logs. The cause of this apparent discrepancy is that VAMPIRE 
was working on a hard subsumption instance when hitting the 
timeout, with the subsequent measurements in Table I showing 
the true cost. Problems such as NLP023+1 are getting stuck 
in the backtracking-based subsumption algorithm, while our 
SAT-based approach would allow proof search to continue 
much further within the same time limit. 

We also evaluated the impact of our custom variable selection 
heuristic (see last paragraph of Section VII) compared to the 
variable-move-to-front (VMTF) heuristic of SAT solvers [8], 
as VMTF is conjectured to perform well for SAT problems 
that are unsatisfiable, being part of the “unstable phase’ 
described in [7]. Given that almost all subsumption instances 
are unsatisfiable, we were interested to see how our SAT-based 
approach performs compared to a VMTF heuristic. Our results 
in this respect are listed in the last line of Table I. While 
our custom heuristic shows slightly better solving times than 
VMITF, the difference is rather small. 

Experimental Results on the Advantage of Re-Using SAT 
Solver Objects. We also assessed the importance of re-using 
the SAT solver object instead of re-allocating the solver for 
every subsumption query. The result is given in the second- 
to-last line of Table I, confirming the significance of having 
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SAT-based subsumption directly integrated in VAMPIRE. 


VII. RELATED WORK 


Subsumption is one of the most important simplification 
rules in first-order theorem proving. While efficient literal- 
and clause-indexing techniques have been proposed [37], [33], 
optimizing the matching step among multisets of literals, and 
hence clauses, has so far not been addressed. In our work, we 
show that SAT solving methods can provide efficient solutions 
in this respect, further improving first-order theorem proving. 

A related approach that integrates multi-literal matching 
into indexing is given in [35], using code trees. Code trees 
organize potentially subsuming clauses into a trie-like data 
structure with the aim of sharing some matching effort for 
similar clauses. However, the underlying matching algorithm 
uses a fixed branching order and does not learn from conflicts, 
and will thus run into the same issues on hard subsumption 
instances as the standard backtracking-based matching. 

The specialized subsumption algorithm DC [18] is based 
on the idea of separating the clause C into variable-disjoint 
components and testing subsumption for each component 
separately. However, the notion of subsumption considered in 
that work is defined using subset inclusion, rather than multiset 
inclusion. For subsumption based on multiset inclusion, the 
subsumption test for one variable-disjoint component is no 
longer independent of the other components. 

An improved version of that algorithm, called IDC [17], 
tests on each recursion level whether each literal of C by itself 
subsumes D under the current partial substitution, which is a 
necessary condition for subsumption. The backtracking-based 
subsumption algorithm of VAMPIRE uses this optimization 
as well, and our SAT-based approach also implements it as 
propagation over substitution constraints. 
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SAT- and SMT-based techniques have previously been 
applied to the setting of first-order saturation-based proof 
search, e.g., in form of the AVATAR architecture [39]. These 
techniques are however independent from our work, as they 
apply the SAT- or SMT-solver over an abstraction of the input 
problem, while in our work we use a SAT-solver to speed up 
certain inferences. 

Some solvers, such as the pseudo-boolean solver Mini- 
Card [23] and the ASP solver Clasp [14], support cardinal- 
ity constraints natively, in a similar way to our handling 
of AtMostOne constraints. Our encoding however requires 
only AtMostOne constraints instead of arbitrary cardinality 
constraints, thus simplifying the implementation. 

We finally note that clausal subsumption can also be seen 
as a constraint satisfaction problem (CSP). In this view, the 
boolean variables b;; in our subsumption encoding (3)A(4)A(5) 
represent the different choices of a non-boolean CSP variable, 
corresponding to the so-called direct encoding of a CSP 
variable [40]. A well-known heuristic in CSP solving is the 
minimum remaining values heuristic: always assign the CSP 
variable that has the fewest possible choices remaining. We 
adapted this heuristic to our embedded SAT solver and use it 
to solve subsumption instances (see Section V). 


VIII. CONCLUSION 


We advocate the use of lean dedicated SAT solving to 
solve clausal subsumption in first-order theorem proving. We 
introduce substitution constraints to encode subsumption as 
a SAT instance. For solving such instances, we adjust unit 
propagation and conflict resolution in SAT solving towards 
a tailored treatment of substitution constraints. Crucially, our 
encoding together with our tailored solver enables efficient 
setup of subsumption instances. Our experimental results 
indicate that SAT-based subsumption significantly improves 
the performance of first-order proving. Extending our work 
towards equality reasoning, and hence addressing subsumption 
demodulation, is an interesting task for future work. For doing 
so, we believe our substitution constraints would need to encode 
matching also on the term level, and thus not only on the literal 
level, in order to find suitable terms to rewrite. 
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Abstract—Max#SAT is an important problem with multiple 
applications in security and program synthesis that is proven 
hard to solve. It is defined as: given a parameterized quantifier- 
free propositional formula compute parameters such that the 
number of models of the formula is maximal. As an extension, 
the formula can include an existential prefix. 

We propose a CEGAR-based algorithm and refinements 
thereof, based on either exact or approximate model counting, 
and prove its correctness in both cases. Our experiments show 
that this algorithm has much better effective complexity than the 
state of the art. 


I. INTRODUCTION 


#SAT is the problem of counting the solutions of a 
quantifier-free propositional formula, the counting version of 
the SAT problem. Max#SAT is the problem of optimizing, 
according to some propositional variables, the number of 
solutions according to the others. We generalize this problem 
to allow an existential prefix in the formula. 

This problem has many practical applications in diverse ar- 
eas of computer science such as quantitative program analysis 
and program synthesis [1]. Most approaches for quantitative 
information flow analysis use approximations, with fast yet 
imprecise solutions. Adaptive attacker synthesis [2] would also 
benefit from advances in Max#SAT efficiency, mainly by being 
able to avoid the use of imprecise heuristics. 

Unfortunately, Max#SAT has high complexity [3], [4], and 
practical solving methods remain costly. At the time of writing, 
only one solver is publicly available off-the-shelf [1]. 

Earlier work on the Max#SAT problem proposed two ap- 
proaches. The first is a probabilistic solving method [1], which 
unfortunately degrades to exhaustive search when seeking 
precise answers to the problem. The second approach [5] 
solves the problem exactly, but scales poorly. 

We present in this paper a new approach to Max#SAT, lever- 
aging ideas from CEGAR solvers, and show its effectiveness 
on various benchmarks used in previous publications on the 
subject. We also present improvements of our algorithm based 
on previous work about symmetry breaking in SAT solvers [6]. 

Our contributions are the following: 


e An effective algorithm to compute maximal solutions 
for the projected model counting problem (Sections III 
and IV). This algorithm relies either on an exact projected 
model counter as a subprocedure, or on an approximated 
one, which should be the case most times in practice for 


This work was partially supported by the French ANR project TAVA (ANR- 
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01) funded by the French program Investissements d’avenir. 
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scalability reasons. A complete correctness proof of this 
algorithm is given for both cases. 

e The extension of our algorithm with SAT symmetry break- 
ing techniques (Section V) and heuristics (Section VI), 
to further improve its efficiency. 

e The implementation of this algorithm in the tool 
BAXMC [7], together with a set of experimental results 
(Section VID, showing the accuracy and performances 
of our Max#SAT algorithm on various benchmarks, with 
respect to the only other available tool. 


II. PRELIMINARIES 


We set our problem in standard Boolean logic. Throughout 
the paper, Greek letters (¢, 7, ....) denote Boolean formulas, 
uppercase calligraphic Latin letters (V, X, Y, Z, ...) denote 
sets of variables, simple uppercase Latin letters (V, X, Y, Z, 
...) denote variables, lowercase variants of these letters (x, y, 
Z, ...) denote valuations for these sets of variables. 

Let B = {true, false}. A literal is a variable or its negation 
and the set of literals derived from a set of variables Y is 
denoted by V = VU-V. Let ¢(V) be a Boolean formula over 
Y a set of variables. A valuation v : Y > B is a model of ġ 
if ọ evaluates to true over v; this is denoted by v |= ¢. 

We say that a formula ¢ is satisfiable if there exists v such 
that v |= ¢. Otherwise, ¢ is deemed unsatisfiable. Determining 
whether a formula is satisfiable or unsatisfiable is called the 
satisfiability problem, also known as SAT. 

The restriction of a valuation v : V —> B to E C V is 
denoted by v|e. We say that two valuations vı and v2 agree 
on E, denoted by vı ~e vo, if their restrictions to € are equal. 


A. Base definitions 


Definition II.1 (Equivalence class). Given a valuation v and 
a set E, we call equivalence class of v over E the set of 
valuations that agree with v over €, that is: 


[le = {v | v” ~e v} 
We call v|e partial models, and v complete models. The 
elements of [v]; are called the extensions or v|¢. 


Definition II.2. Given propositional formula ¢(V) and E£ C Y, 
Me (¢) = {vle | v = o} denotes the set of models projected 
over E. 


Remark II.1. We omit the set € when it contains all the 


variables of d. That is M (¢) = My (¢). 
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Definition II.3. Given a valuation v, we define the update of 
the variable V € V to b as: 

X 

v[V > b(X) = { 4 ) 


Definition II.4. Given two formulas ¢(V) and 7(V), we say 
that ¢ entails = (denoted by ¢ = 4) if M (¢) C M (4). 


if X AV 
otherwise 


B. Domain-specific definitions 

In the remaining of the paper we consider a partition of 
Y over three sets X, VY and Z, respectively called witness, 
counting and intermediate variables. Given a Boolean formula 
o(X, VY, Z), we define Max#SAT as an optimization problem 
stated as follows: find £m E€ Mx (¢) such that the projected 
model counting over Y of the formula (which will be defined 
later) is maximal. 


Definition II.5 (Induced set). Given a formula ¢(¥,Y, Z) 
and x E€ Mx (¢), the set of models over Y induced by «x is: 


Tg (a) = {y € My (4) | 32. =e} 


We extend this definition to partial witnesses as follows: 
Ig (ale) = Us'elzl. Tg (2'). 
Definition II.6 (Model counting). Given a formula 
(X,Y,Z), the count of a witness x is defined by the 
size of the set it induces |¢ (x, Y, Z)| = |o (x)|. 

We extend this definition to partial witnesses as follows 


lọ (zle; X, Z)| = [To (zle). 


Definition II.7 (Max#SAT). Given formula ¢(¥, Y, Z), we 
can state the Max#SAT problem more formally as finding 
Lm E Mx (Q) such that: 


Id (Em, X, Z)| = 


(x,y, 2) 


max 
xEM x (>) 


|¢ (x, Y, Z)| 


Property II.1. Given a formula 4(4,, Z), the count of a 
partial witness is an upper-bound of the count of its extensions: 


vz’ € [t]e,|¢(2', Y, Z)| < lọ (zle, XY, Z)| 


Proof. This follows directly from Definition II.5 on induced 
sets. 


Property II.2 (Monotony of model counting). Given a propo- 
sitional formula ¢(¥, V, Z), AC B C X, and x € Mx (¢), 
the count of partial solutions is monotonous: 


l¢ (z|B,Y, Z)| < |¢ (z| 4, X, Z)| 
Proof. First, following Definition II.1 we have: 
tle € lel, 
Hence, following Definition IL5: 


Ig (|B) C Ig (aa) 


Property II.3. Given a Boolean formula ¢(¥, VY, Z) such that 
x E€ Mx (¢), E C X and X; € X, we have: 


lọ (ale, XY, Z)| < lọ (zle-{x:} X, Z)| < 
|d (zle, VX, Z)| + 16 (zļle[X; > 72(Xi)], XY, Z)| 


Proof. The first inequality is a direct consequence of Prop- 
erty II.2. The last inequality follows from Definition I.5. 


Property II.4. For a given ¢(¥V,¥, Z) and (X, V, Z) such 
that d = ¢’, a witness x, and E C ¥ we have: 
lo (zle, X, Z)| < |% (zle, X, Z)| 
Proof. For any y € Ig (x|e), as y | ¢, and @ F ¢’, we get 
y H| @’ and hence Ig (ale) C Iy (ale). 
Property II.5. Given 6(41, YV, Z), Y(X) and z H w, we have: 
|o(z, Y, Z)| = (A ¥) (x, X, Z) 
Proof. Since x = w and a does not depend on Yy and Z, 
(x,y,z) = ¢ if and only if (x,y,z) EdAY. 
IHI. SOLVING MAX#SAT 


This section presents the main algorithm we propose to 
solve the Max#SAT problem. 


A. The main algorithm 


Algorithm 1 takes as input a formula ¢(¥,),Z) and 
computes a pair (p,m) such that £m is a solution to 
Max#SAT for @ with model counting nm. Together with the 
formula, the algorithm takes multiple precision parameters: 

e (c;) that are called tolerance parameters [8]; 

e (ôi) that are called confidence parameters [8]; 

e « that is called the persistence parameter. 

Further explanations about these parameters will be given 
later. 

Roughly speaking, this algorithm consists in iterating over 
possible witnesses x of ¢. If the model count for x is less 
than the current best solution, it blocks generalizations of x 
such that all extensions of these generalizations are worse than 
the current best solution (Lines 14 and 16), hence removing a 
chunk of the search space at each iteration. Otherwise, it saves 
the candidate, which is then the new maximum, and blocks 
it (Lines 10 and 16), removing only one candidate from the 
search space. 

We use two kinds of oracles in this algorithm. At Line 7 we 
call a SAT solver. Calls to an existing #SAT oracle (Lines 5, 8 
and 17) can be performed using either an exact or an approxi- 
mate model counter. In the latter case the precision parameters 
taken as input of the algorithm are used to configure the oracle, 
and influence the correctness of the returned value (in the 
former case, simply assume that they are all 0). 


Definition II.1. Given ¢(¥,Y, Z), x € Mx ($) we say that 
E C X is n-bounding if |¢ (z|e, Y, Z)| < n. 


The GENERALIZE function used in Algorithm 1 at Line 14 
is proved to return Nnm-bounding sets in both the exact (The- 
orem IV.1) and the approximate case (with probability 1 — ô, 
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Algorithm 1 Pseudocode for the BAXMC algorithm 


1: function BAXMCe, ¢1,50,51,50,%0( P(X, V, Z)) 

2: bso 

3: Lm & T 

4: Nm 0 

5 N + MCa so (¢ (0, Y, Z)) 

6 while nm < 7%; do 

7: rM x (os) > Pick a new candidate 
8: C— MC,, 5, (ds (x, VY, Z)) 

9: if c > Nnm then > New maximum 
10: Lm £ 

11: Nm © C 

12: EX 

13: else > Find generalization 
14: E «+ GENERALIZES, (£, s,m) 

15: end if 

16: bs — bs An (zle) > Block 
17: N MC,, 6, (bs (0,Y,Z)) 

18: end while 

19: return Tm, Nm 


20: end function 


Theorem IV.2). The GENERALIZE function is called Algo- 
rithm 2 in this paper and it will be presented in Section IV. 


B. Termination and correctness with an exact #SAT oracle 


In this subsection, each i-indexed variable of the algorithm 
denotes its value at the end of the i-th iteration of the main 
loop. In the exact version of the algorithm, all precision 
parameters are assumed to be equal to O and all calls to 
MCo o ($ (x, YX, Z)) return |¢ (x, Y, Z)|. 


Theorem II.1 (Termination with an exact #SAT oracle). 
Algorithm 1 always terminates. 


Proof. By construction of (¢5;); we have: 


The sequence (nm,;); is obviously increasing. From Prop- 
erty I.4, the sequence (N;); is decreasing and hence 
(Ni — nmi); is decreasing. 

Putting all this together, (Mx (¢s;)| + (Ni — nmi)); is 
strictly decreasing. 

On can easily see that whenever |M x (ġs;)| = 0 it follows 
that N; = 0 and N; — nm; < 0. Hence in all cases, after some 
iteration k, Nk — Nm, < 0 and the termination follows. 


Remark UI.1. The worst case complexity of Algorithm 1 is 
reached when it iterates over all the witnesses of the formula. 


Let k be the number of iterations performed when Algo- 
rithm 1 terminates, then we have nmg > Nk. 


Lemma III.1. At every iteration i of Algorithm 1, we have: 


Mx (bsi) = Mx (¢) — U lede, 


j<i 


Proof. This follows by construction of ġs;. 


Lemma III.2. At every iteration i of Algorithm 1, and assum- 
ing GENERALIZE(x, ¢, n) returns n-bounding generalizations 
of x (as defined in Definition II.1) we have: 


Var" € U [ty], ’ lo (2, Y, Z)| < Nmi 
IS 
Proof. Let j < i, and let x € [tj]e,+ following Definition III.1, 
we have \os, (x, Y, Z)| L Nmj- 
Then by construction of s; and Property II.5 we have 
| (x, Y, Z)| < nmj which, as (nmi); is increasing, proves 
the lemma. 


Theorem III.2 (Correctness with an exact #SAT oracle). 
Algorithm I is correct, i.e., the returned tuple (£m, Nnm) 
satisfies the following relation: 


Nm = |b (£m, X, Z)| = 


max 
rEM x ($) 


l (2, Y, Z)| 


Proof. Following Property II.1 and since nmg > Nz we have: 
Va E€ Mx (Psk) - lé (2,V, Z)| < Nk < mk 


Then instantiating Lemma III.2 at iteration k we have: 
Va € (J lede, \6(#,Y, Z)| < nmg 
i<k 


Following Lemma III.1, at iteration k we have Mx (¢) = 
Mx (bsr) UUicx [£i]e, and the result follows. 


C. Correctness with a probabilistic #SAT oracle 


Since the termination can be proven in the same way as in 
the exact case, we only prove the correctness. 

Let us first recall the expected guarantees provided by 
an approximate model counter [8], where the e parameter 
characterizes the precision of the result and the ô parameter 
determines its associated confidence. 


Property III.1 (Correctness of the Model Counting). The 
count MC,.5 (¢(x, Y, Z)) returned by an approximate model 
counter satisfies the following: 


1  MCes ($ (2, 9,2) 
l+e ~~ \o(x, VY, Z)| 


These guarantees extend to partial witnesses naturally, i.e., 
queries of the form MC, 5 (¢(zle,V, Z)). 


<1l+e}/>1-6 


The next theorem proves the correctness of Algorithm 1 in 
the approximate case and gives the associated tight bounds. 


Theorem II.3. Let (£m, Nm) be the result returned by the 
call BAXMCe .¢1,59451,59,n(P(¥, V, Z)), and let 


M= max xz, Y, Z 
ees )I 
If 5) < yga then: 
1 n 
< ue <l+e|>1-6 
LHe ` Em IZ) > S 
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and 


See) 
(1+ 6€ 9) *(1+e,) * (14+) 
a >(1 = 61) x min(1 ae 69,1 = do) 


P [lo emd 21> 


Proof. Let dg be the final value of the variable ¢, after the last 
iteration of the while loop. We have the following guarantees 
from the approximate model counter (Property II.1): 


1 n 
= m <1lt+e|>1-6 1 
lite ~ 18 @m¥, 2) ~ “| pe es 
| 1 N 
1+ e ~ |My (¢s)| 
From Theorem IV.2 regarding the GENERALIZE function 


(which will be proved in the next section), we also have that 
for any x E€ Mx (p ^~gs) it holds (assuming that 6, < 


<14 


o > 1-— ðo (2) 


eee): 
P [lọ (x, X, Z)| < Nm] 2 1— bg. (3) 


After the last iteration of the while loop we have that nm * 
(1+ «) > N. Using this and Equation (2) and Property I.5 
we get that for any x E€ Mx (¢¢) it holds 


P[|¢(z,Y,Z)| < nm *(1+4)*(1+60)] > 
P [ld (x, Y, Z)| < N * (1+ €0)] 
P [los (2, Y,Z)| < N*(1+e)] > 
P[|My (¢s)| < N*(1+e)] > 1-60. 


From Equation (3), for any x E€ Mx (¢A ~es) it holds 


P [le (@,Y; Z)| < nm * (1 + s) * (1 + €0)] 
P [lọ (£, Y, Z)| < nm] > 


IV 


1— bo. 
Hence, for any x E€ M x (@) it holds 


P [|9 (2, ¥, Z)| < mm * (1 + K) * (1+ €0)] 


> min(1 — dg, 1 — ôo) 


and hence 
Nm = M 
Ll+e, ~ (1+«)* (1+ 69) * (1+6) 


> min(1 — 62,1 — ôo). (4) 
Combining this with the Equation (1), we obtain 


TE 
(L+«)* (1+ 69) * (1+ e1) 
(1 ae 61) * min(1 ma 62, 1— ĉo). 


p hlo em». 2) > 


The following corollary instantiates Theorem II.3 in order 
to get the standard form (as in Property III.1). 


Corollary II.1. For any 0 < €,6 < 1, if in the call of the 
BAXMC function, we take as parameters € = €) = Kk = 


then the result 


ay ee = 
V1i+e—1, ðo = 62 = $ and ôi = za 


(£m, Nnm) satisfies the following inequalities: 


P| — an <1- Sg 
: _ 
1+e T |¢(tm,¥,Z)| 7 H 


M 
P mN Zi >—]| >1-6 
[oem 2.22 ee] > 
where 
M = max 
zEMx(¢$) 
Proof. It is easy to check that ey = Y1 +e— 1< €, 6, = 
ma < 2 =$ <ð, (1+ e)? = +e and (1 — ô) * 
(1 — ðo) = 1 — dg — 6) + ôo * 0) > 1— 2% ôo = 1 — ô. 


l$ (a, Y, Z)| 


IV. GENERALIZATION ALGORITHM 


Algorithm 2 generalizes a single model x with insufficiently 
high count to a set of models with insufficiently high count. 
This is much the same that a CDCL loop blocks not only one 
assignment, but a whole set of assignments. 

As shown in Property II.2, generalizing a witness is an 
instance of the MSMP problem (Minimal Set subject to a 
Monotone Predicate), which can be solved using generic 
algorithms such as QUICKXPLAIN [9]. Although in theory 
this should lead to a better algorithm, in practice we observed 
larger numbers of calls to the #SAT oracle, an issue already 
identified in other contexts [10]. 

Algorithm 2 is thus a specific solver of the MSMP problem 
in our setting, relying on a linear sweep over the variables that 
are part of the valuation. 

For efficiency reasons, the steps mentioned in Algorithm 2 
are in a precise order. The reason behind this is: 

1) The first step relies on a consequence of Property II.3, 
allowing to relax variables with simple calls to a sat 
solver. 

2) The log-based generalization is a heuristic allowing to 
do big steps in the generalization process by relaxing 
multiple variables at each loop turn. 

3) The linear sweep pass generalizes x in such a way 
that the returned set is minimal, i.e. that none of the 
further generalizations of the returned value satisfies 
Definition II.1. 

The returned € is guaranteed only to be a local minimum 
and it may not be the smallest set such that Definition II.1 
holds because of the order in which we consider variables of 
X in Algorithm 2. 


A. Correctness and complexity with an exact #SAT oracle 


Property IV.1. If dé(a|c[X; > 72(X;)],Y,Z) is UNSAT 
then: 
l¢ (tle, Y, Z)| = |¢ (zle-{x:} X, Z)| 


Proof. This follows directly from Property II.3. 


Let us prove the correctness of Algorithm 2 in the context 
of an exact #SAT oracle. This will finish the correctness proof 
started in Section II-B. 


173 


Algorithm 2 Pseudocode for the generalization algorithm 


1: function GENERALIZES (£, 6(4,Y, Z), Nm) 

2 EX 

3 for all X; € ¥ do > Redundancy elimination 
4 if ¢ (a[X; > 72(X;)],Y, Z) UNSAT then 

5: EC E-{Xj} 

6 end if 

7 end for 

8 ke log nm — log MC, 5, ($ (zle, X, Z)) 

9: while k > 0A |E| > 0 do > Log-elimination 
10: Ak & {VCE V| =k} 

11: c+ MC, 5, (¢ (ale_a,,V, Z)) 

12: ifc< re then 

13: EC E— Ak 

14: k + log nm — loge 

15: else 

16: ke k-1 

17: end if 

18: end while 

19: for all X; € X — E do > Refinement 
20: if MC, 5, (¢ (zle-{xp V, Z)) < TE then 
21: EC E-{Xi} 
20: end if 
23: end for 


24: end function 


Theorem IV.1. Algorithm 2 terminates and is correct: the 
returned set E satisfies Definition II.1, i.e., |b (x|e, Y, Z)| < 
n. 


Proof. In the while loop at Line 9 we can see that, at 
each iteration, either |E| or k decreases, thus ensuring the 
termination of the algorithm. 

During any update of the temporary value € (Lines 5, 
13 and 21), we ensure that the new value of E satisfies 
Definition II.1: 


1) At Line 5, Property IV.1 keeps the model counting 
stable. 

2) At Lines 13 and 21, the update is guarded by the explicit 
check of the property (in the if statement Lines 12 
and 20). 


Hence the correctness follows. 


B. Bounds with an approximate #SAT oracle 


Theorem IV.2. Let E C X be the set returned by the call 
GENERALIZE5 (£, ¢(¥,, Z), n), and assume that 


ô 
P[|¢(z,¥,Z)| <n] > 1- ari 
Then: 
P [jọ (zle,¥,Z)| <n] >21- ô 


Proof. Using Property IV.1, the variable E after the first loop 
within Algorithm 2 satisfies |¢ (x, Y, Z)| = |¢ (ale, Y, Z|. 


We denote by C,),, the value returned by the call 
MC: s5 (¢ (xlv, VY, Z)). Since each time we update € to a set 


V we ensure Czy < trz» we have the following probability: 


P [l (zle, Y, Z)| < n] 21-61 


Let €; denote the value obtained after l updates of variable 
E during LOG-ELIMINATION and REFINEMENT steps within 
Algorithm 2 and let us denote by P, the probability that the 
set £; is approximately n-bounding. 

Using that we update E to the value €; only if Cz] a (1+ 
€) < n, we have the following recursive relation: 


P [| (z|, VX, Z)| < n] * Pa 
> (1—1) * Pi > (1 — 6,)' * Po 


> (1-8)'«(1- 7) 


Thus, as | < ||, if we take 6, = ot 
#SAT oracle with parameters (e€, aH) we get: 


P= 


and we call the 


P {ld (ale, Y, Z)| < n] > (1—61)'** > 1-(141) «5, > 1-6 


Remark IV.1. The bound with respect to the number of updates 
is tight. The worst case is reached when the only valid subset 
of X is X itself, that is when the model cannot be generalized. 


V. BREAKING SYMMETRIES IN MAX#SAT 


Symmetries are a special kind of permutations of the input 
variables of a formula leaving it intact. Exploiting or breaking 
symmetries in SAT formulas has long been a topic of interest. 

For instance, if a formula is left intact by such a permutation 
then for each blocking clause C, the solver may need to 
generate the full orbit of C by the group of permutations, 
leading to combinatorial explosion. Breaking the symmetry 
means selecting one solution per orbit by adding a predicate 
called symmetry breaking predicate to the formula, purpose- 
fully generated to break the symmetries. The resulting formula 
is equisatisfiable, but often simpler to solve. 


A. Correctness in the presence of symmetries 


In our context, handling symmetries within the witness 
set reduces the size of the search space, and leads to better 
complexity. We give in this section arguments about why this 
is true. 


Definition V.1. Given a Boolean formula $(4, YV, Z), a sym- 
metry of ¢ is a bijective function o : X +> X& that preserves 
negation, that is (=X) = 70(X), and such that, when ø is 
lifted to formulas, o(¢) = ¢ syntactically [11]. 

Se denotes the set of all symmetries of ¢. We lift Sy 
to models by defining the set of symmetries of a model zx, 
Sp(2) = {200 | o € Sy}. 


Theorem V.1. In Algorithm 1, picking only one x per symme- 
try class of ọ preserves the correctness of the algorithm both 
in the exact and approximate case. 
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Proof. Whatever the method used to select only one member 
of each symmetry class, this corresponds to creating a sym- 
metry breaking predicate u(%) and solving the problem over 
o Aq), and thus the Property I.5 applies. 


B. Implementing Max#SAT symmetry breaking 


We detect symmetries in ¢ using the automorphisms of a 
colored graph representing the formula, defined as follows: 


e For each variable, create two nodes: one for the positive 
literal, and one for the negative literal. Use color 0 if 
the variable is in X, otherwise use color 1. Add an edge 
(Boolean consistency edge) between the two nodes. 

e For each clause, create a node, and assign to it the color 2. 
Add an edge between this clause node and every node 
corresponding to a literal present in the clause. 


Many tools can be used in order to list the automorphisms 
of a graph. In our case, we used BLISS [12] because of its 
C++ interface, and its performance. 

After detecting the symmetries, one can use any symmetry 
breaking technique available, either static [13] or dynamic 
[6]. In our implementation, we chose to use CDCLSYM [6] 
because of its ease of use, and because it avoids generating 
complex symmetry breaking predicates ahead of time. 


VI. HEURISTICS AND OPTIMIZATIONS 


We present in this section heuristics used in both Algo- 
rithms | and 2 in practice, and discuss their effectiveness. 


A. Progressive construction of the candidate 


A simple yet effective optimization is to gradually add 
literals to the candidate x in Algorithm 1 at Line 7. By 
stopping earlier, this allows to call GENERALIZE on a partial 
assignment instead of a complete one, and will decrease the 
number of calls to the #SAT oracle as it anticipates work that 
is done in Algorithm 2. 


B. Leads 


When performing the generalization in Algorithm 2, one 
can see that we can extract hints about promising parts of the 
search space when relaxing variables. Indeed, when relaxing 
parts of the solution (Lines 16 and 22), if the model count 
of the relaxation goes above nm, then this part of the search 
space may contain an improvement over the current solution. 

Following this intuition, one can hold a sorted list! of 
relaxations whose count is above the current best known 
maximum, and use it to favor parts of the search space that 
look promising. We call these promising relaxations leads. 
More formally, given Z|¢ a lead, when searching for a new 
solution in Algorithm 1 at Line 7, instead of searching in 
Mx (os), one would search in M æ (s) N [#¢. 

Let L,,(¢) denote the set of leads currently known to the 
solver with count lower than n. When the currently known 


The order to use here is: first the count of the relaxation, then the size of 
the relaxation. 


maximum is improved in Algorithm 1 at Line 10, we can 
block all leads whose count is below the new maximum: 


A 


[@]¢€Lnm (9) 


Psi41 = ds; ^ = (Ze) 


C. Decision heuristic 


As discussed in Section IV-A, the performances of the algo- 
rithm depend on the order with which variables are considered 
in various parts of the solving process (in the generalization 
and during the optimization presented in Section VI-A). One 
can see that this kind of problem, that we call variable schedul- 
ing, is actually predominant when solving SAT problems, and 
even #SAT problems. 

One first heuristic arises from the leads described in Sec- 
tion VI-B. One can use the leads list as indications for literals 
leading to promising parts of the search space, by finding 
the literal which appears the most in the leads. We call this 
heuristic leads. 

Another decision heuristic can be devised using 
VSIDS [14]. The idea is to assign a weight to each literal 
based on its last appearance in a blocking clause. The weight 
of each literal is increased by a constant amount every time 
the literal appears in a blocking clause, and is multiplicatively 
decreased at each blocking clause. This heuristic showed 
promising results in both SAT and #SAT [15]. We call this 
heuristic vsids. 

One could also choose the next decision variable at random, 
which we call rnd. And finally, one could just pick the 
decision variables in the order they are provided to the tool, 
which we call none. 

An experimental evaluation is done in Section VI-B. 


D. Handling equivalent literals 


Equivalent literals are a notorious property of Boolean 
formulas which, when exploited, results generally in better 
runtime performances [16]. 


Definition VI.1. Given a Boolean formula ¢, we say that two 
literals L; E€ V and Lj € V are equivalent if 6 = Li & Lj. 


Equivalent literals allow to simplify formulas based on the 
following theorem. 


Theorem VI.1. Let ¢ be a Boolean formula and two equiva- 
lent literals L; and L;. Then solving the Max#SAT problem for 
@ is reduced to solving the Max#SAT problem for the simpler 
formula ¢' obtained by replacing all occurrences of Lj (resp. 
aL;) by L; (resp. =L) when: 

1) either L; and Lj are in the same literal class (either X, 

Y or Z) 

2) or L; € X and Lj EVUZ 

3) or Li € Y and Lj € Z. 

Theorem VI.1 can be applied multiple times in order to fur- 


ther simplify the formula. Literal equivalence can be detected 
using binary implication graphs [17]. 
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VII. EXPERIMENTAL EVALUATION 


Algorithm 1 has been implemented in an open-source tool 
written in C++ called BAXMC [7], including dynamic sym- 
metry breaking techniques (Section V) and all the heuristics 
discussed in Section VI. In this implementation, we only 
incorporated the approximate version of the algorithm using 
APPROXMCS [18] as an approximated model counting oracle 
and CRYPTOMINISAT [19] as a SAT solver oracle. An exact 
solver is not implemented because we do not, at the time of 
writing, have another exact Max#SAT solver available as a 
comparison. 

We use three sets of benchmarks, coming either from [20], 
or from MaxSat 2021 competition [21]. Benchmarks from this 
later class are transformed using the method from [1]. Table I 
shows more details about the benchmark set considered. 
Benchmarks annoted with a star indicate that a symmetry was 
found. 

All experiments are run on a Dell R640 with 40 cores and 
192 GB of RAM running Debian 11, with a 2-hour timeout, 
a 10 GB memory limit and with parameters 6 = 0.2, € = 0.8. 


A. Comparison to MAXCOUNT 


MAXCOUNT [1] is used as an off-the-shelf solver of the 
problem, with parameters corresponding to ô = 0.2, € = 0.8. 
Note that these are not the parameters used in the experiments 
in [1] and that we reimplemented MAXCOUNT using newer 
oracles. We did this in order to see how MAXCOUNT and 
BAXMC behave when both are providing the same correctness 
guarantees and using the same oracles for fairness. All figures 
from Table II are obtained when BAXMC is used with the 
(leads, rnd) heuristic combination. 

Table II shows the results obtained when running both tools 
on our three benchmarks. Bolded values are the best values 
on this line (i.e., smaller time or biggest answer). The time 
columns are the running times of the tools. The model count 
columns are the values returned by the candidate tools. 

One can see that BAXMC outperforms MAXCOUNT in all 
benchmark timings. In cases where BAXMC did not find the 
best value, it terminates when the bounds on the possible 
maximum are tight enough. This yields a small error margin 
on the returned value of BAXMC, but is configurable through 
its k argument. 


B. Decision heuristic comparison 


Table III shows a comparison between the heuristics that are 
currently available in BAXMC. Lines enumerate the decision 
heuristics from Section VI-C. Columns specify heuristics used 
by the underlying SAT oracle about literals polarities. 

Each cell of this table contains, in sequence: the total run- 
ning time, the number of time this combination ran the fastest 
compared to all others, and the number of times this combina- 
tion timed out. For example combination (leads, cache) 
ran for a total time of 62968.38 seconds with 7 timeouts, and 
ran the fastest on 3 benchmarks over a total number of 26. 
In this setup, any time-out from BAXMC increases the total 
running time by 7200s. 


The table shows that none of the heuristics stands out. 
We can only eliminate random decision as a bad heuristic. 
Nevertheless, the combination of heuristics allows to strongly 
reduce the overall number of timeouts. 


VIII. RELATED WORKS 


Previous works on Max#SAT solving may be classified into 
three categories, based respectively on probabilistic solving 
as in MAXCOUNT [1], exhaustive search [5] and knowledge 
compilation [22]. 

Probabilistic solving relies on “amplification” to build a 
new formula (X,Y, Z) = Ni p(X, Vi, Zi), where the V; 
and Z; are fresh copies of the initial Y and Z variables, 
and uniformly sampling among M x (2). The higher the k, 
the more the sampling is attracted towards the ¥ with large 
projected model counting over (););<,. Given parameters € 
and 0, the guarantees provided about the returned tuple (ñ, £) 
are the same as in Corollary II.1 [1]. Unfortunately, when the 
size of the formula increases, uniform sampling may become 
quite expensive as shown in our benchmarks. Furthermore, 
this approach is not incremental: looking for a better solution 
involves re-running the search from scratch. 

On the other side of the spectrum lie exhaustive searches. 
The idea here is to make incremental decisions among the 
variables in X, propagating the decision in ¢, and simpli- 
fying the formula in order to cache some results [5]. Such 
approaches are exact, but their exhaustive nature limits their 
scalability. Component caching [23] is a practical way to 
improve scalability [5] and it could be beneficial into our 
algorithm too. 

Knowledge compilation consists in compiling the formula 
into a representation over which solving the problem (here, 
the optimal model counting) is expected to be much easier. 
Compilation times tend to dominate and the memory usage of 
the compiled form may be huge. 

A possible approach could use a generalization of 
&-constrained SDDs [22]. The idea here would be to 
build (4, )-constrained SDDs, that is SDDs that are X- 
constrained, and for which each subtree that are not over ¥ are 
y-constrained. In this case, one can easily compute the count 
of every possible pair x E€ Mx (dé) and then propagate the 
maximum to the root of the tree. To the best of our knowledge, 
this direction has not been explored yet. 


IX. CONCLUSION AND FUTURE WORK 


We proposed a CEGAR based algorithm allowing to solve 
medium-sized instances of the Max#SAT within reasonable 
time limits, as illustrated in our experiments. This algorithm 
allows either to compute exact solutions (when possible), or 
can be smoothly relaxed to produce approximated results, 
under well-defined probabilistic guarantees. Comparisons with 
an existing probabilistic tool showed the gains provided by our 
algorithm on concrete examples. Our implementation and all 
the related benchmarks are available on [7]. 

From an algorithmic point of view this work could be 
extended in several directions. 
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Table I 
BENCHMARK LIST 


Name |X] WI |Z] | Nr. Clauses 
backdoor-32-24* 32 32 83 76 
backdoor-2x16-8* 32 32 136 272 
pwd-backdoor 64 64 272 609 
bin-search-16 16 16 1416 5825 
CVE-2007-2875 32 32 720 1740 
CVE-2009-3002 288 240 | 443 180 
reverse 32 32 165 293 
ActivityService 70 34 4063 15257 
ActivityService2 70 34 4063 15257 
ConcreteActivityService 71 37 4728 17856 
GuidanceService 69 27 3167 11612 
GuidanceService2 69 27. 3167 11612 
IssueServiceImpl 77 29 3519 13024 
IterationService 70 34 4063 15257 
LoginService 92 27 5110 21559 
NotificationServicelImp12 87 32 5223 22006 
PhaseService 70 34 4063 15257 
ProcessBean 166 39 9675 41444 
ProjectService 134 48 6778 24944 
sign 16 16 107 392 
sign_correct 16 16 92 346 
UserServiceImpl 87 31 3901 14653 
drmx 1030 17 26 2094 
keller4 43 15 62 2525 
g2_n35e34_n58e61 34 7 954 38130 
Table IT 
PERFORMANCE COMPARISONS BETWEEN BAXMC AND MAXCOUNT 
Benchmark name BAXMC MAXCOUNT 
Time (s) | Sym. Time (s) | Model count (log) Time (s) | Model count (log) 
backdoor-32-24* 611.12 34.50 32 231.87 32 
backdoor-2x16-8* 60.02 61.07 16 6512.28 16 
pwd-backdoor 236.87 240.63 64 TO - 
bin-search-16 1067.38 1048.43 16 1490.44 16 
CVE-2007-2875 36.14 37.39 32 TO - 
CVE-2009-3002 TO TO - MO - 
reverse TO TO - MO - 
ActivityService 3060.39 3064.60 33.95 TO - 
ActivityService2 3096.54 2999.72 33.95 TO - 
ConcreteActivityService 84.20 84.44 36.91 TO - 
GuidanceService 1468.39 1474.51 26.88 TO - 
GuidanceService2 1459.74 1474.76 26.88 TO - 
IssueServicelImpl 1603.21 1583.50 28.88 TO - 
IterationService 3081.86 3068.95 33.95 TO - 
LoginService 5275.25 5197.84 26.92 TO - 
NotificationServiceImp12 1286.48 1287.94 31.91 TO - 
PhaseService 3071.86 3105.18 33.95 TO - 
ProcessBean TO TO - TO - 
ProjectService 5770.26 5544.02 47.92 TO - 
sign 73.56 73.43 15.90 819.58 16 
sign_correct 74.58 73.78 15.89 819.56 16 
UserServicelImpl TO TO - TO - 
drmx 24.39 24.07 16.99 TO - 
keller4 TO TO - TO - 
g2_n35e34_n58e61 0.17 0.41 2.53 TO - 
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Table TI 


PERFORMANCE COMPARISON BETWEEN HEURISTICS OF BAXMC 


cache neg pos rnd 
leads 62968.38 -3 —7 65542.94 — 2 — 6 65222.40 -1 - 4 67233.96 -0 -5 
rnd 144081.73 — 0 — 19 139002.15 — 0 — 18 140755.04 — 0 — 17 | 137407.70 — 0 - 17 
none 60368.28 — 3 — 5 62729.76 - 2 — 4 61317.54 -3 - 4 56860.50 — 3 — 4 
vsids 69165.19 — 2 — 8 56189.26 -1-5 54017.07 - 3 — 4 63865.03 - 2 — 6 


First, we exploited some classes of symmetries when 
solving Max#SAT (Section V). This could be improved by 
detecting new kinds of symmetries [13], or exploiting them 
further using techniques such as symmetry propagation [24]. 

As discussed in Section IV, our relaxation algorithm (Al- 
gorithm 2) uses a linear sweep over the literals composing a 
witness. Instead of returning one possible minimal relaxation, 
MERGEXPLAIN [25] returns multiple ones, which may be 
helpful in our case by allowing the creation of multiple 
blocking clauses. 

As expected, in some instances, our algorithm may degen- 
erate into exhaustive search. While we do not know yet any 
characterization of all such instances, we believe that pre- 
processing and in-processing [26] techniques such as UNHID- 
ING [17] should improve performances and limit the set of 
inefficient instances. 

Finally, Algorithm 1 may be parallelized by correctly 
scheduling search spaces among threads, possibly using the 
leads described in Section VI-B. If we enforce the fact that 
all leads currently present in the lead list are disjoint, that is 
the [ț] are pairwise disjoint (hence splitting the search space 
into parts), we expect a favorable parallelization setting. 
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Abstract—Isolators are a useful tool for reducing the compu- 
tation needed to solve graph existence problems via SAT. We 
extend techniques for creating isolators for undirected graphs to 
the tournament (complete, directed) case, noting several parallels 
in properties of isolators for the two classes. We further present 
an algorithm for constructing n-vertex tournament isolators with 
O(n log n) unit clauses. Finally, we show the utility of our new 
isolators in computations of tournament Ramsey numbers. 

Index Terms—Satisfiability, Symmetry-breaking, Directed- 
graphs, Tournaments, Isolators. 


I. INTRODUCTION 


In recent years, SAT solvers have been used to solve sev- 
eral difficult combinatorial problems [1]-[3]. However, naive 
encodings of SAT problems often include undesired symme- 
tries, i.e. certain matching subsets of variables that result in 
equivalent subproblems when given equivalent assignments. 
To prove the original formula unsatisfiable, in the worst case 
a solver must search through all possible symmetric parts of 
the problem space, which slows the generation of unsatisfiable 
proofs unnecessarily. Similarly, while the solver tries to find 
a satisfying assignment, symmetries in the input formula may 
cause the solver to effectively re-explore the same part of the 
search space even after proving the lack of a solution in a 
symmetric part of the problem. 

The most common way of reducing the impact of symme- 
tries in a given formula is by adding a set of new clauses 
called a Symmetry-Breaking Predicate (SBP) to the formula 
before solving [4]-[6]. The goal of a SBP is to preserve the 
satisfiability of the formula while removing from consideration 
any regions of the search space known to be symmetric to other 
regions. In this work we focus on SBP’s for graph existence 
problems, which are problems that can be solved by checking 
if a graph with a particular structure exists. Solving such 
problems is an active area of research [7]-[9]. A large class 
of problem symmetries in graph existence problems naturally 
results from the existence of isomorphic labeled graphs. These 
symmetries exist independent of any desired graph property 
related to graph structure. Rather, they occur because SAT 
solvers must search the space of labeled graphs in order to 
prove the (non-)existence of an unlabeled graph. A SBP that 
targets graph isomorphisms is known as an isolator. Isolators 
that break many symmetries with few clauses are most useful 
in practice, as SAT solvers generally take longer to solve 
formulas with more clauses. Such isolators are often described 


as “short”, “small”, or “compact.” 
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Prior work has shown that it is possible to generate small 
isolators for undirected graphs [10]. The present work instead 
handles the generation of short isolators for tournaments: com- 
plete, directed graphs. There are several mathematically inter- 
esting questions one can ask about tournaments that motivate 
the generation of tournament isolators. For example, Sumner’s 
conjecture and various election models in social choice theory 
rely on tournament properties [11], [12]. Tournament isolators 
can also aid in the search for doubly-regular tournaments. 
Doubly-regular tournaments are a class of tournaments that 
(among many other properties) can be efficiently transformed 
to skew-symmetric Hadamard matrices [13], which have a 
wide array of practical uses. However, the most well-known 
question about tournament structure is the Tournament Ramsey 
number problem, an analog to Ramsey numbers [14] that 
asks the question of “in what size tournament n must a 
transitive subtournament of size k exist.” A (sub)tournament is 
transitive if it contains no cycles. Calculating the tournament 
Ramsey number for k = 7 is likely the limit of currently 
known techniques, and doing so would be impactful for the 
mathematical community. 

The first contribution of this work is the generation of 
compact tournament isolators that asymptotically match the 
search space reduction of a perfect isolator. Second, we present 
a methodology for the generation of compact isolators for 
small tournaments that extends prior work on undirected 
tournaments [10]. Finally, we demonstrate the practical usage 
of our small isolators for finding larger graphs relevant to the 
search for tournament Ramsey numbers. 


II. PRELIMINARIES 


We define the following common concepts from SAT litera- 
ture: A literal is either a variable or a negated variable. We use 
= to denote negation. A clause is a disjunction of literals. A 
unit clause (sometimes referred to as simply a unit) is a clause 
containing exactly one literal. A Conjunctive Normal Form 
(CNF) formula is a conjunction of clauses. Unless otherwise 
specified, “formula” refers to “CNF formula.” An assignment 
qa is a function from variables to truth values (True/False). a 
satisfies a formula F' if the boolean function denoted by F 
returns True given the inputs specified by a. 

We also define several graph-theoretic concepts. A tourna- 
ment G = (V, E) is a complete directed graph; more formally, 
V(v1, 02) EV x V, v1 4 v2 > ((v1, 02) E€ E)B((v2, 1) € E) 
and Vu € V, (v, v) ¢ E, where @ is the XOR operation. The 
phrase “G is an n-vertex tournament” means |V| = n. Given 
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an n-vertex tournament G = (V, E) and a permutation 7 on V, 
m(G) is defined as 7(G) = (V, {(a(v1), 7(v2))| (v1, v2) E E}) 
and is colloquially referred to as applying m to G. Two n- 
vertex tournaments G1, Gy are isomorphic (written G1 ~ G2) 
exactly when there exists a permutation 7 on the vertices of G1 
such that 7(G';) = G2. When any such 7 exists, it is referred 
to as an isomorphism between G1 and G2. The isomorphism 
class (also, equivalence class) Ig of a tournament G is defined 
as Ig = {G'|G ~ G’}. An automorphism 7 on a tournament 
G is any permutation m such that 7(G) = G. The set of 
automorphisms of G form a group under function composition. 
This group is referred to as Aut(G). 


III. ISOLATOR NOTATION AND CONCEPTS 


To search for a tournament G satisfying some structural 
property, we define variables with the semantics “edge e exists 
in graph G” for use in a formula F. We say that F admits a 
graph G” exactly when there exists a satisfying assignment to 
the conjunction of F and the set of unit clauses semantically 
implied by the edges of G’. An isolator for n-vertex tourna- 
ments is a formula F that admits at least one tournament from 
each equivalence class on n-vertex tournaments. A perfect 
isolator is an isolator that admits exactly one tournament from 
each equivalence class. A perfect isolator F is optimal if 
there does not exist a perfect isolator with fewer non-unit 
clauses than F. A compact or short isolator is not rigorously 
defined. Rather, it describes an isolator with few enough non- 
unit clauses to be of practical use in solving SAT problems. 

In this work, vertices will be denoted with lowercase letters 
a,b,c,... or with v1,v2,...,Un When an ordering of the 
vertices is relevant. Arcs (directed edges) will be referred to 
with (u,v), meaning “there is an arc from u to v? In our 
construction of isolators, each variable is written in the form 
uv and has the semantics “arc (u, v) exists in the graph.” Note 
that the literal suv therefore means “arc (v, wu) exists in the 
graph.” 


A. Short Isolator Examples 


Consider the following two labeled 3-vertex tournaments. 


(o) Co) 
Oo AO 


These tournaments represent the only two equivalence 
classes for n = 3 tournaments: a cycle and a transitive 
tournament. While any combination of a cycle and transi- 
tive tournament would suffice to represent both equivalence 
classes, the tournaments chosen above have the interesting 
property of sharing two edges ab and bc (colored red). This 
property allows us to produce a short formula that admits both 
graphs: 

ab A be. 


This formula admits exactly one of the two labeled cycles 
and one of the six labeled transitive tournaments on 3 vertices, 


Fig. 1. All isomorphism class representatives admitted by a perfect, optimal 
isolator for 4-vertex tournaments. Red edges are edges fixed by unit clauses 
of the isolator, and the isolator has only unit clauses. 


and does so with the fewest possible clauses. Therefore, abA bc 
is a perfect, optimal isolator for n = 3 tournaments. 

Figure | displays canonical representatives of all 4 isomor- 
phism classes for n = 4 tournaments. We note that once again 
all highlighted edges have the same edge labels across graphs, 
and all permutations of non-highlighted edges are present. So, 
a short formula that admits exactly the set of graphs in the 
figure is 

ab A be A^ cd A ad. 


While the optimal isolators for n = 3,4 are comprised 
entirely of unit clauses, this pattern does not hold for n = 5. 
Table I contains the number of unit and non-unit clauses for 
our isolators on n < 8 vertices. 


B. Comparison of undirected graph and tournament isolators 


Although the majority of this work focuses on tournament 
isolators, there are many interesting parallels between undi- 
rected and tournament isolators. In an undirected context, the 
existence of edge (u,v) is denoted by the literal uv, while its 
nonexistence is denoted by the literal’s negation ~uv. Because 
edgeless and complete graphs are isomorphism classes for any 
n in the undirected case, every clause of an undirected graph 
isolator containing only arc literals must contain at least one 
positive and one negative literal. These two graphs do not exist 
in the case of tournaments; the closest parallel is transitive 
tournaments. Unlike the set of n-vertex undirected graphs 
which contains exactly one empty graph and one complete 
graph with n! automorphisms each, there are n! isomorphic 
transitive tournaments on n vertices. It is possible to select 
the particular transitive tournament TT that an isolator admits 
by ensuring that at least one edge from TT is present in each 
clause of the isolator. A simple way to do so is ensure each 
clause contains at least one edge uv s.t. u < v in vertex 
numbering. 

One consequence of undirected isolators requiring at least 
one positive and one negative literal per clause is that undi- 
rected isolators have no unit clauses. However, while negating 
all literals in an undirected isolator produces another undi- 
rected isolator (because the existence and non-existence of an 
edge is symmetric), there is no direct parallel to be found in 
tournaments as edge directionality does not have this property. 

Another interesting difference between undirected graphs 
and tournaments is the low number of isomorphism classes 
for tournaments when n is small (see table I). Intuitively, 
this happens because it is “easier” for tournaments to be 
isomorphic. The two options for the edge between vertices 
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u and v in the undirected case are uv existing or not existing. 
Crucially, an undirected graph G will never be isomorphic to 
G” constructed by adding or removing an edge of G, which is 
an operation that can be seen as “flipping” an edge to its other 
possibility. However, “flipping” an edge of a tournament T by 
changing the edge’s direction will produce T’ ~ T iff the two 
vertices u and v of the flipped edge had the same edges to the 
rest of the graph (the isomorphism is via the permutation that 
swaps u and v). Although this discrepancy exists for small 
n, the numbers of isomorphism classes for undirected graphs 
and tournaments are remarkably close for larger n (see OEIS 
A000088, A00056 [15]). Therefore, we expect that perfect, 
optimal isolators for undirected graphs and tournaments will 
have similar numbers of clauses for larger n. 


C. Arc Literal Numbering 


Each uv must be assigned a corresponding integer to 
conform to the commonly used DIMACS CNF format. To do 
so, we specify a function idx,,(u, v) to map each possible arc 
(u, v) in an n-vertex tournament to a unique integer identifier. 
Because exactly one of (u,v) and (v,u) must exist for any 
two vertices u,v, idx must satisfy id£n(u, v) = —id£n (v, u). 
To facilitate isolator comparisons across different n, idx also 
should satisfy the property that idzn (u, v) = tdtn+41(u, v). We 
therefore drop the subscript n when referring to idzn(u, v) in 
the future, as its value does not depend on n. 

In particular, idx is inductively defined as follows for 
an n + 1-vertex tournament with vertices v1, V2,,..-Un,Un+1- 
Let K = n(n — 1)/2 be the largest output of idz for 
an n-vertex tournament (implying base case idx(v1,v2) = 
1 when n = 2). Applying idx to each of the 
arcs (U1, Un+1), (V2; Un+1), (Un, n41) yields K +1,K + 
2,...,& +n, respectively. All arcs not included in this defini- 
tion are of the form (vw, vu) where w > u, and are defined by 
the earlier mentioned constraint of idx(u,v) = —idx(v, u). 


IV. UNIT CLAUSES 


In practice, SAT solvers immediately reduce formulas with 
unit clauses to shorter formulas without units via unit prop- 
agation. Additionally, each unit clause reduces the size of 
the search space by a factor of 2. Therefore, it is practically 
useful to create isolators with as many units as possible. The 
following sections detail and analyze our various methods for 
creating isolators with many unit clauses. 


A. Provable Units 


While constructing smaller isolators using the techniques 
above, we opted to manually inspect our results and see what 
patterns they shared. In doing so, we rediscovered a well- 
known fact from graph theory literature; every tournament 
contains a Hamiltonian path [16]. Proof sketch: inductively 
consider a length n Hamiltonian path v1, vo, ...Un in an n+ 1- 
vertex tournament G = (V, E). For the vertex vn+1 not part of 
the path, in the case that either (Un+1, V1) or (Un, Un41) is in 
E, a length n + 1 Hamiltonian path is formed. Otherwise, 
(V1, Un41) and (Un41,Un) are in Æ and thus there must 


exist consecutive vertices v;,v;+1 in the Hamiltonian path 
such that arcs (v;,Un41) and (Un+41,Vi+1) are in FE. In this 
case, the sequence v1, V2, ...Ui; Un+1, Vit1,---Un forms a length 
n + 1 Hamiltonian path. As a result of this property, a set 
of unit clauses describing a Hamiltonian path on an n-vertex 
tournament is always a valid n-vertex isolator. 

Given the utility of unit clauses in isolators, it is natural 
to ask how many units there can possibly be in an n-vertex 
isolator. As it turns out, there is a long-known result from 
graph theory that implies that asymptotically there are at most 
O(n log n) units possible. By the orbit-stabilizer theorem, the 
size of the equivalence class of a graph G on n vertices is 
CHIE where Aut(G) is the set of distinct automorphisms 
of G. In 1963 Erdős and Rényi proved that as n approaches 
infinity, the proportion of undirected graphs of size n with with 
nontrivial automorphisms approaches 0 [17]. The same result 
for tournaments directly follows. Therefore, a proportion of 
tournaments approaching 1 has equivalence classes of size n!, 
so the asymptotic number of equivalence classes is 


9(2) (2) 


n! lonen) = o(262)=7 8n), 


An isolator with k unit clauses for n-vertex graphs admits 
at most 2(2)-* equivalence class representatives, so in order 
to admit at least one member of each equivalence class (by 
the definition of an isolator), the number of units in an isolator 
must also be asymptotically upper-bounded by n logn. 

In the next section, we provide a procedure that achieves 
this bound. 


B. TT-fixing 


In situations where we know that every member of the class 
of n-vertex tournaments contains a TT; (a transitive tourna- 
ment of size k), we also know that every equivalence class 
must contain a member with the tournament fixed in some 
arbitrary position and orientation (i.e. vertices 1 through k in 
ascending order). Therefore, any formula that fixes (i.e. asserts 
the existence of) a TTk on the class of n-vertex tournaments 
is a valid isolator. Because the remaining subset of n — k 
non-fixed vertices also forms a tournament, further knowledge 
about the existence of a transitive tournament within the 
remaining n — k vertices can be used to fix (via units) another 
transitive tournament within the n — k vertex subtournament. 
This procedure can be repeated until all vertices of the original 
tournament are part of some fixed transitive subtournament. 
Tournament Ramsey numbers provide exactly the required 
information about the existence of a transitive subtournament. 
In fact, tournament Ramsey numbers R(k) (when known) 
provide the largest TT), guaranteed to exist in a tournament of 
size at least R(k). Therefore, tournament Ramsey numbers (as 
well as upper bounds, which exist for arbitrarily large n) can 
be used to iteratively construct large sets of unit clauses for 
tournament isolators: we will refer to this process as TT-fixing. 

TT-fixing is best understood via a small example like figure 
2. For an arbitrary 16-vertex tournament G, R(5) = 14 implies 
that G must contain a TT; as a subtournament. Therefore, the 
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Fig. 2. A visual depiction of the unit clauses provided TT-fixing in the 
adjacency matrix of an n = 16 tournament. For entries with value 1 at row 7 
and column j, the unit clause corresponding to (v;, vj) is added by TT-fixing. 
Equivalently, the 1’s and 0’s shown will exist in the adjacency matrix of any 
graph admitted by a TT-fixing isolator. 


arcs between vertices v1...U5 can be “fixed” to be a transitive 
tournament by generating all unit clauses corresponding to 
those arcs. However, the remaining 16 — 5 = 11 vertices of 
G also form an arbitrary subtournament G’ on 11 vertices. 
Because R(4) = 8, a TT; is guaranteed to exist in G”, so we 
can add unit clauses corresponding to the specific location of 
that TT%’s existence in vertices vg...vg. The repetition of this 
procedure down to 1 or 0 remaining vertices is TT-fixing. 
C. TT-fixing gives O(nlogn) units 

Let units(n) be the function that returns the number of 
units that can be added to an isolator when using the TT- 
fixing method on n-vertex tournaments. Our goal is to prove 
a lower bound on wnits(n). Unfortunately, exact tournament 
Ramsey numbers are non-trivial to calculate (only up to 
R(6) = 28 is known). However, from Erdős and Moser we 
have that R(k) < 2*~! [18], ie. that a TT; must exist when 
considering any tournament on 2*~1 or more vertices. Erdős 
and Moser’s bound can thus be used with TT-fixing to lower- 
bound units(n). 


We claim that units(n) > >> 4[logs(i)|. We proceed via 
i=1 


induction, with step n depending on step n — k, with k = 
|log.(n)| + 1. The proposition is true for n = 1 because 
0 > 0, n = 2 because 1 > 0.5. By definition of TT-fixing, for 
a graph with n vertices we have 


k(k —1) 
2 
By the inductive hypothesis, 


units(n) = + units(n — k). (1) 


n—-k 
units(n — k) > yD [log, (i) | /2. (2) 
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Fig. 3. A visual depiction of how the number of unit clauses produced by 
TT-fixing grows under different assumptions about Ramsey numbers. 


Next, we have that 
k-1 


p-—= 
2 


Il 


k|loga(n)]/2 


n 


>. Lloge(n)|/2 


i=n—k+1 


n 


>, 


i=n—k+1 


IV 


[log (7) | /2. (3) 


Combining lower bounds (2) and (3) for the terms of eq. (1) 
completes the proof: 


units(n) = Mee = + units(n — k) 
n n—k 
> J> loga(i)]/2+} Uog:(i)]/2 4) 
i=n—k+1 i=1 


= Ds [logs (z) | /2. (5) 


This inequality result directly implies the asymptotic n log n 
bound, because log,(n!) € O(n log n). 


D. Practical vs Theoretical TT-fixing units 


We first note a useful recurrence relation on tournament 
Ramsey numbers: R(k) < 2R(k — 1). 


Proof. Consider an arbitrary vertex v in an arbitrary tourna- 
ment G on 2R(k — 1) vertices. v must have either an out- 
degree or an in-degree of at least R(k — 1). In either case, 
consider the subset of at least R(k — 1) vertices pointed to/at 
by v. This subset must contain some TTk—1ı as a subgraph by 
definition of R(k — 1). However, v points to or at all vertices 
in this T’T;,_,, which demonstrates that a TTẹ comprised of 
the TTk—ı vertices and v exists in G. 


In Figure 3 the bottom two lines depict the strict lower 
bound used in the nlogn units proof (blue), as well as the 
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Fig. 4. A depiction of the search space reduction provided by TT-fixing using 
best-known Ramsey number bounds up to n = 18 in logy space. 


actual number of units TT-fixing would provide if we only 
used the R(k) < 2*~1 bound from the proof (red). Above 
that (orange) is the number of units TT-fixing provides given 
the best currently known Ramsey number bounds. The best 
known bound on R(7) is 34 < R(7) < 47 [19], so the black 
line describes the best case for how many unit clauses TT- 
fixing could provide if R(7) = 34 was proven. The recurrence 
relation R(k) < 2R(k—1) is what allows even improvements 
to small Ramsey number bounds to impact the efficacy of TT- 
fixing for large n. 

In Figure 4, the top (orange) line is the total number of 
graphs a SAT solver must search in a tournament existence 
problem in the absence of an isolator. The bottom (red) line 
is the number of unlabeled tournaments on n vertices; this 
is the minimum number of graphs that any brute-force solver 
must search to solve a tournament existence problem. This 
data was taken from OEIS sequence A000568 [15], which 
limits the size of n for which we can make this comparison to 
n = 19. The middle (blue) line shows how many graphs are 
admitted by a TT-fixing isolator using the best known bounds 
on tournament Ramsey numbers. As n grows large, the gap 
between the bottom two lines should grow small as per the 
nlogn units upper bound proof. 


E. Undirected Isolators: Clique-fixing 


As mentioned earlier, undirected isolators cannot have unit 
clauses. Therefore, undirected isolators cannot directly benefit 
from units via TT-fixing. However, a crossover result for 
undirected graphs does exist for binary clauses that uses the 
same ideas as TT-fixing; we term this process clique-fixing. 
Undirected Ramsey number guarantee the existence of a red 
or blue colored k-clique for graphs with more than R,,(k) 
vertices (R, used here for undirected Ramsey numbers). 
Clique-fixing uses the same iterative process as TT-fixing, but 
generates the following clauses instead of T’T;, units: 


{r Ve, =r V vele € Edges(K;,)} 


where r is an auxiliary variable representing the concept 
“the k-clique is red” and Edges( Kp) is the set of edge literals 
for the complete graph on k vertices. We note that these 
clauses are “almost” units in the sense that after a solver 
makes a decision about whether to set r to true or false, 
C) edges are set by unit propagation. Therefore, clique- 
fixing steps reduce the search space by half as much as 
TT-fixing steps do. Although not the focus of this work, 
it is plausible that a similar asymptotic optimality analysis 
could be done for clique-fixing given this small discrepancy. 
However, undirected Ramsey numbers (necessary for clique- 
fixing) empirically grow much faster than tournament Ramsey 
numbers (and also theoretically: Ry(k) < 4Ry(k — 1)), so 
clique-fixing may not be as practically useful as TT-fixing. 


V. PERFECT, OPTIMAL ISOLATOR SAT ENCODING 


Unit-based techniques scale to arbitrary n, and TT-fixing 
is “asymptotically perfect” in the sense that for large tour- 
naments, no isolator generation technique can provide more 
than a non-constant factor of search space reduction over TT- 
fixing. However, no known perfect isolators for n > 4 consist 
solely of unit clauses. Additionally, it can be practically useful 
to have an optimal perfect isolator for small tournaments 
to allow searching via SAT solver for only non-isomorphic 
(sub-)graphs as efficiently as possible. The practical utility of 
compact perfect isolators is demonstrated in our own exper- 
iments in the later “Tournament Ramsey Graphs” section. In 
the following sections, we describe our technique for creating 
perfect, optimal isolators for n < 6. 


A. Basic SAT encoding 


We re-implemented and modified the perfect isolator en- 
coding for undirected graphs [10] to be used for tournaments. 
Formally, we encoded the question “Is there a set of k 
clauses C,C2,...C, that is a perfect isolator for n-vertex 
tournaments.” Decoding a solution to this formula allowed us 
to produce an n-vertex isolator with k clauses. 

For the ith isolator clause C; and arc literal l, we defined the 
variable In(C;, l) to represent “l is in C;”. Then, for each tour- 
nament G on n vertices, we define variables Excludes (G, C;) 
for 1 < i < k to mean “clause C; does not admit G.” 
This specification is implemented as follows with a Tseitin 
encoding [20] to handle the equality and conjunctions: 


Excludes(G,C;) + N 7In(Ci,1). 
le Ag 


(6) 


Here Ag is the set of arc literals corresponding to the arcs 
present in graph G. We also define the variable Canon(G) 
for all graphs G, meaning “Graph G is the canonical repre- 
sentative of its isomorphism class Ig.’ We implement this as 
follows (again using Tseitin): 


k 
Canon(G) & VAN —~Ezcludes(G, C;). 


i=1 


(7) 
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Finally, for each isomorphism class J, we add the following 
clauses representing “exactly one graph in J is canonical” to 
the formula for each isomorphism class T: 


ExactlyOne({ Canon(G)|G € I}) (8) 


Here ExactlyOne is implemented with an At Most One 
operation via Sinz encoding [21] and an At Least One via 
disjunction. Therefore, a satisfying assignment to this formula 
corresponds to a perfect isolator on k clauses. If the formula 
is unsatisfiable for k and satisfiable for k + 1, then the perfect 
isolator with k + 1 clauses is optimal for the n in question. 


B. Symmetry Breaking 


One symmetry in the above encoding is the order of the 
isolator clauses, as reordering clauses of an expression in 
CNF does not affect its satisfying assignments. To break 
this symmetry, we added clauses that ensured a lexicographic 
ordering of the clauses in the resulting isolator. For every 
adjacent pair of clauses C; and C1, we fixed some ordering 
of every literal that may appear in them l1,l2,...,ln, and 
then created variables eo, €1,...,€, where ej represents that 
clauses C; and C41 are equivalent when considering only the 
first 7 literals. e9 is always true, and to maintain the semantics 
of the other e; we added the clauses 


ej + (ej-1 A (In(Ci, lj)  In(Ci41,15))) 


via the Tseitin transformation for every 1 < i < kand 1 < j < 
n. Then, we enforced a lexicographic ordering by requiring 
that for every j such that C; and C;i}ı were equal up to j, 
that if clause C; contained l; then Ci+ı must also contain lj. 
Explicitly, we added the following requirement via the Tseitin 
transformation for every 1 <i < k and1 < j <n: 


ej—1 N In(C;,l;) = In(Ci41,1;) 


and furthermore we required that e,, is false to ensure a strict 
ordering. When searching for an isolator with k clauses, this 
reduces the search space by a factor of k! as only one of 
the k! permutations of a given distinct set of clauses will be 
considered. 

There is another symmetry in the vertex labeling. For a 
given isolator, for each literal 1 corresponding to arc index Ig», 
we can change / to correspond to arc index Iq) .(p) where 7 
is a permutation of vertex labels. The resulting isolator accepts 
the same graphs that the original did, but under vertex permu- 
tation 7. To break this symmetry, note that any tournament 
isolator must admit exactly one transitive tournament. So, we 
choose to admit only the canonical transitive graph with edges 
of the form (u;,v;),i < j. Note that because every edge in 
this graph goes from a lower numbered vertex to a higher 
numbered vertex, the corresponding literals in our encoding 
are all positive. As such, we know that for any isolator, there 
is a permuted isolator such that every clause has at least one 
positive literal in each clause. We may add this to our encoding 
by requiring for all clauses C 


VÆ OR) 


lEAp 


with A, being the set of all positive literals. When trying to 
find an isolator for n vertices, this reduces the search space by 
a factor of n! since the solver is guaranteed to only consider 
isolators for which the canonical transitive graph is the one 
described above. 


C. Encoding Unit Propagation 


Under the encoding described above, our solver finds iso- 
lators with many large clauses. However, by applying unit 
propagation it was often possible to reduce clause sizes. This 
indicated that not only was the solver generating solutions 
that needed postprocessing, but candidate isolators that were 
equivalent under unit propagation were being considered mul- 
tiple times — a sort of symmetry in this problem. To resolve 
this, we added variables Unit(l) representing “literal l is a 
unit clause.’ We then required that the isolator be already 
unit-propagated with respect to these literals by adding the 
requirement 

aIn(C, 1) V aUnit (1) 


for all clauses C and literals 1. We also had to account for 
these units excluding graphs in the Canon clauses, which were 
updated to 


k 
Canon(G) & VAN ~Ezrcludes(G, C4) A V =~Unit(~l) 
i=l leAg 


Finally, we considered whether to count these special unit 
literals towards the clause count in determining isolator op- 
timality. As mentioned in the preliminaries, we chose not to 
do so. When an isolator with units is used in a SAT solver, 
the units will be instantly eliminated through unit propagation 
and thus will reduce the complexity of the resulting problem. 
Therefore, we consider an optimal isolator to not just have 
the minimal number of clauses, but the minimal number of 
non-unit clauses. Since units cannot exist in undirected graph 
isolators (because an undirected graph isolator must admit both 
the complete and empty graph), this definition of optimality 
is consistent with the prior work on the undirected case [10]. 
Note that we only needed to consider positive unit literals as 
per the vertex-labeling symmetry breaking, which drastically 
reduced the search space. 


VI. ADDITIONAL ISOLATOR GENERATION TECHNIQUES 


The following sections describe several miscellaneous tech- 
niques, ranging from practical ways to gain slight improve- 
ments on prior techniques to possible directions for future 
research. 


A. Incremental Isolators 


Prior work has already shown that any isolator for n-vertex 
tournaments is also an isolator for n+k-vertex tournaments for 
any positive k, and that combining an isolator on m vertices 
with an isolator on n vertices by applying each isolator to a 
disjoint subset of vertices creates a new isolator on m+n 
vertices [22]. Therefore, it is possible to construct perfect 
isolators for n + k-vertex tournaments by adding clauses to 


184 


any isolator for n-vertex tournaments. In particular, our SAT 
encoding pipeline had the option to ignore graphs that are not 
admitted by a given set of units. Including the maximal set of 
units from an n-vertex isolator when generating an encoding 
for n + 1-vertex isolators reduces the number of tournaments 
to generate Canon clauses for by a factor of at least 2” 
because each isolator has at least the units corresponding 
to a Hamiltonian path. It is worth noting that we do not 
have any proofs that any of our non-perfect or non-optimal 
isolators can be extended to an optimal isolator, even when 
the isolator being extended from is comprised of only unit 
clauses. However, extending an isolator from an initial set of 
units can make searching for compact isolators much more 
efficient. 

The technique of combining isolators is useful for creating 
compact isolators for large n. Although TT-fixing guarantees 
asymptotic optimality, it does not always add the optimal 
number of units for small n. For example, TT-fixing will 
generate 9 units when processing 8-vertex (sub)tournaments, 
while an isolator for n = 8 with 11 units is possible. 


B. Probing 


In addition to the SAT encoding approach to isolator gen- 
eration, we also generated isolators using a method from 
prior work called “random probes” [10]. On a high level, 
this approach starts with an empty set of clauses and adds 
randomly generated clauses that preserve at least one member 
of each equivalence class until the isolator is perfect. There 
were only two non-superficial changes needed to adapt the 
prior work on random probes for undirected graphs to the 
directed case; allowing unit clauses and allowing clauses with 
only positive literals. While not guaranteed to generate optimal 
isolators, the strength of this approach is the relative speed 
with which isolators are generated. This approach also bene- 
fited in efficiency from the technique of disallowing clauses 
with all negative literals and extending isolators from the unit 
clauses of smaller isolators. 


VII. RESULTS 


Our experimental results include the sizes of known perfect 
isolators for small n, as well as experiments showing the 
practical utility of small n = 6,7 perfect isolators for solving 
a tournament existence problem. All results and code are 
available at https://github.com/evanlohn/digraph_isolators. 


A. Experimental Setup 


Our SAT-based approach to generating isolators rely on the 
creation of “map” files: text files associating each tournament 
of size n with a label representing that graph’s isomorphism 
class. In order to generate a map file for tournaments on n 
vertices, we began by enumerating all 2"("—!)/2 graphs of size 
n. We converted each graph into an adjacency matrix and then 
into the “.d6” format specified in the NAUTY handbook, then 
fed the resulting graphs into the labelg script bundled with the 
NAUTY tool for graph isomorphisms [23]. labelg produced a 
file where each graph was converted to the canonical form 


used by nauty. We gave each canonical form a unique label 
and outputted the arc (directed edge) indices of each original 
graph alongside its canonical form. 


B. Small Optimal Isolators 


Our SAT encoding allowed us to compute optimal isolators 
up to n = 6. The SAT solver CaDiCaL [24] solves the 
two instances required to prove optimality (k = 6,7 non- 
unit clauses) within 24 hours. Figures 1 and 6 graphically 
display optimal, perfect isolators for n = 4,5 by displaying 
a graph from each isomorphism class. Figure 5 presents the 
same image for one of the 56 isomorphism classes for n = 6. 
Most of the structure of these isolators can be seen from their 
unit clauses, which are depicted via red edges in the figures. 


Fig. 5. One of the 56 isomorphism class representatives admitted by a 
particular isolator for 6-vertex tournaments. Red edges are edges fixed by 
unit clauses of the isolator. 


For n = 7, solving the SAT instance directly became clearly 
infeasible (taking several days without any signs of progress). 
However, random probing allowed us to find a perfect isolator 
for n = 7,8. Each probe ran in around 10 seconds when 
restricted to force a positive literal in each clause with the 
map file reduced by the unit clauses from the next largest 
isolator. Table I describes the best (fewest non-unit clauses) 
isolator found for 1 < n < 8. Several thousand probes were 
required to find our best known isolator for n = 7, while 2 
probes were used to find our n = 8 isolator (each n = 8 probe 
required about 2 days to finish). We note that n = 8 isolators 
can have up to 11 unit clauses; the n = 8 isolator in Table I 
was the shortest perfect isolator we generated via probing. 


TABLE I 
SHORTEST PERFECT ISOLATORS FOUND FOR n < 8 


Vertices | Isomorphism classes | Best units | Best non-units 
1 1 0 0 
2 1 1 0 
3 2 2 0 
4+ 4 4 0 
5 12 6 2 
6 56 8 6 
7 456 9 47 
8 6880 10 665 
C. Tournament Ramsey Graphs 
The known tournament Ramsey numbers are R(2) = 2, 


R(3) = 4, R(4) = 8, R(5) = 14, and R(6) = 28 [25]. Note 
that in most cases, the next number is two times it predecessor. 
Recently, the lower and upper bounds for R(7) have been 
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Fig. 6. All isomorphism class representatives admitted by a particular isolator 
for 5-vertex tournaments. Red edges are edges fixed by unit clauses of the 
isolator. The two non-unit clauses in the isolator are ac V abd V ce and 
ac V ae V —ce. 


improved from 32 < R(7) < 54 to 34 < R(7) < 47 [19]. 
The improved lower bound is due to dozens of TTy-free 
tournament on 33-vertices found using SAT. 

McKay [26] extended this set of 33-vertex T T7-free tour- 
naments to 5303 using the following method: generate all 
29-vertex subtournaments of known 33-vertex T'T7-free tour- 
naments and extend them in all possible ways to 33-vertex 
TT;-free tournaments. Repeat this procedure until no new 33- 
vertex 7’T7-free tournaments are found. Also note that if a 
tournament has no TTk, then its complement (reversing all 
arcs) also doesn’t. This can be used to find additional T’7;,- 
free graphs as well. 

Looking for neighbors and complement graphs is a well- 
known technique to compute more graphs with a certain 
property. McKay and Radziszowski used it to compute all 
known 42-vertex graphs that have no clique of size 5 nor 
a co-clique of size 5 [27]. They conjecture that this method 
generated all possible graphs of this type. 

For all known Ramsey numbers R(k), there are unique 
tournaments without a TT;, of size R(k) — 1 and R(k) — 2. 
Generalizing this property, if for some n there exists a k 
with a unique TTkp-free tournament on n vertices, then that 
graph is known as ST’. For example, the unique T’T¢-free 
tournaments on 26 and 27 vertices are referred to as ST 26 


and S797 respectively. 

Prior to our work, there were 5303 known 7J'T7-free tour- 
naments on 33 vertices, implying that R(7) > 34. So, either 
k = 7 breaks the pattern of existence of ST „ tournaments, or 
R(7) > 34. We studied the 5303 33-vertex TTy7-free tourna- 
ments and found that they all have ST26 as a subtournament. 
Moreover, 4952 of them have S797 as a subtournament. 

It is the case that any T'T7-free tournament on 34 vertices 
contains at least 1 (up to isomorphism) T’T7-free subtourna- 
ment on 33 vertices. Therefore, enumerating further T’T7-free 
tournaments on 33 vertices is a step towards either finding 
a TTz-free 34-vertex tournament or proving that no such 
tournament exists. With this motivation, we explored whether 
the suite of 5303 was complete or whether there are any other 
33-vertex T'T7-free tournaments. Our main experimental setup 
involved finding new members containing ST 26 but not ST'27 
by solving a CNF formula with a SAT solver, which uses our 
isolator on 7 vertices. The formula can be described as the 
union of the following sets of clauses: 


1) (5) = 325 unit clauses requiring that ST26 be present 
in vertices vı through v26; 

2) The perfect isolator for n = 7 on the seven remaining 
vertices v27 through v33; 

3) A clause blocking each of the 5303 known solutions for 
each vertex permutation that caused the solution to have 
ST 26 in vertices vı through vag and a graph admitted 
by the n = 7 isolator in vertices voy through v33; and 

4) clauses enforcing the “no TT” condition from [19]. 


While this formula does not disallow all STəys (i.e. a 
solution might include an extension from ST2¢ that was not 
present in the original solution set), it disallows all currently 
known extensions, including the most common by far 1- 
vertex extension from ST9¢ to S727. Additionally, the n = 7 
perfect isolator plays a crucial role for finding new solutions 
in that without it, the SAT solver could find any tournament 
equivalent to one of the previously known 33-vertex T’T7-free 
tournaments except for some non-automorphic permutation of 
the last 7 vertices (which would thus be isomorphic to the 
previously known solution). All solutions to our formula are 
non-isomorphic to the original 5303 tournaments. 

On the Pittsburgh Supercomputing Center [28], we ran 640 
shuffled (clause permuted) versions of the above formula on 
640 cores for 6 hours using the Kissat solver [29]. We found 
three different satisfying assignments. These three solutions 
represented a single new 33-vertex TTy-free tournament, 
which is shown in Figure 7. This tournament is special as it is 
self-complementary: reversing all arcs result in an isomorphic 
graph. Only a small fraction of tournaments has this self- 
complementary property [30]. Note that all ST,, graphs have 
this property by definition. After finding this new tournament, 
we updated the formula to include the blocking clauses for the 
new tournaments and its isomorphisms. Kissat did not produce 
further solutions when using 640 shuffled (clause permuted) 
version of the updated formula on 640 cores in a day, so it is 
possible that the formula is simply unsatisfiable. 
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Fig. 7. The new 33-vertex TT7-free tournament found using our perfect isolator for n = 7. The upper-left section of the matrix is ST26, while the 


bottom-right section is a graph admitted by our n = 7 isolator. 


VIII. CONCLUSIONS 


Our techniques allow the generation of isolators with 
asymptotically optimal numbers of unit clauses, as well as 
perfect, optimal isolators for n < 6 and compact isolators for 
n = 7,8 found by random probing. We further demonstrate 
how small isolators can be effectively used in the search for 
much larger graphs relevant to tournament existence problems. 
Future work using our results may lead to further improve- 
ments on bounds for the tournament Ramsey number problem. 
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Abstract—Many verification and validation activities involve 
reasoning about constraints over complex, hierarchical data 
types. For example, distributed protocols are often defined using 
state machines that govern the behavior of processes communi- 
cating with messages which are hierarchical data types with state- 
dependent constraints and dependencies between component 
fields. Fuzzing, analyzing and evaluating implementations of such 
protocols requires solving complex queries that pose challenges 
to current SMT solvers. Generating fields that satisfy type 
constraints is one of the challenges and this can be tackled using 
enumerative data types: types that come with an enumerator, an 
efficiently computable function from natural numbers to elements 
of the type. Enumerative data types were introduced in ACL2s 
as a key component of counterexample generation, but they do 
not handle constraints such as dependencies between types. We 
extend enumerative data types with constraints and show how 
this extension enables applications such as hardware-in-the-loop 
fuzzing of complex distributed protocols. 

Index Terms—verification, data types, distributed systems, 
fuzzing, counterexample generation, ACL2s 


I. INTRODUCTION 


The motivation for this paper stems from a project to ana- 
lyze the IEEE 802.11 Wi-Fi protocol. Since the introduction 
of the first IEEE 802.11 standard in 1997 [1], the Wi-Fi family 
of protocols have become a key part of many user’s ability to 
access the Internet. In 2019, Cisco predicted that over half of 
global Internet traffic will be transmitted over Wi-Fi and over 
20% of global Internet traffic will be transmitted over a mobile 
network by 2022 [2]. Therefore, securing wireless networks 
and their underlying hardware is of critical importance. One 
method that researchers have used to demonstrate vulnerabili- 
ties in the Wi-Fi protocol is fuzzing, a form of testing in which 
generated data (possibly invalid) is input to a system, which 
is monitored for crashes, nonconforming responses, or other 
undesired behavior. Fuzzing has historically been successful 
in testing software systems, but bringing it into the realm of 
hardware raises several challenges. 

Consider the general problem of validating the confor- 
mance of a given hardware device to a wireless protocol 
using hardware-in-the-loop fuzzing, where we have no inter- 
nal knowledge of the device under test (DUT). In order to 
obtain good coverage of such protocols, we have to force the 
DUT into a variety of protocol states. Interesting protocols 
are nondeterministic, so we cannot easily precompute a set 
of messages to send; instead we must generate messages 
dynamically, in response to actual messages received from the 
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DUT. Another complication is that protocols typically contain 
complex constraints on the format and contents of messages, 
making it infeasible to generate well-formed messages using 
standard fuzzing techniques. Finally, we note that such hard- 
ware devices are fast and associated protocols often involve 
short timeouts, on the order of hundreds of microseconds. 
Therefore, to effectively validate the protocol conformance of 
such devices, we must generate well-formed messages at high 
speeds. 

The prevailing approach for message generation in scenarios 
like the above has been the development of custom software 
like Wifuzzit [3] and owfuzz [4]. Developing such software 
takes a significant amount of highly specialized engineering 
effort. A more general and powerful approach is to use formal 
methods to model the protocol under which the DUT is being 
tested and to then automatically generate protocol messages 
from that model, using formal methods tools. Unfortunately, 
current formal methods are not powerful enough to generate 
messages of the required complexity and at the required rate, 
as explained in detail later. 

To address the above problem, we present enumerative data 
types with constraints, an idea that enables the fast generation 
of elements of hierarchical data types with constraints and 
inter-field dependencies. Our work is a natural extension of 
enumerative data types [5]: types that have enumerators, 
functions from natural numbers to elements of that type. We 
implemented the idea in the context of ACL2s [6], [7] and per- 
formed an evaluation by generating certain messages described 
in the 802.11 Wi-Fi protocol. Our evaluation shows that we 
are able to generate messages for a wide variety of sizes, 
something that neither SMT solvers nor pure enumerative data 
types can do. For the classes of messages that can also be 
generated by SMT or enumerative data types, our approach is 
at least two orders of magnitude faster. 

Our contributions are as follows. (1) The idea of enumera- 
tive data types with constraints, which allows for the efficient 
generation of elements of dependent types with constraints 
and field interdependencies. (2) Extensions to the existing 
enumerative data type framework in ACL2s to support lists 
with length and ordering constraints, as well as improved 
support of numeric ranges. (3) The evaluation of our ideas 
with a case study on fuzzing Wi-Fi access points. All tools, 
models and artifacts developed for the case study, including 
sets of SMTLIB2-formatted constraints that may be useful 


article is licensed under a Creative 


This 
Commons Attribution 4.0 International License 


(definec foo (x :int) :bool 
(!= x (expt 2 63))) 
(property (x :int) (foo x)) 
$>... 
We falsified th 
counterexamples: 
—-((X 9223372036854775808) ) 


conjecture. Here ar 


Fig. 1. A definition of a function and a property that ACL2s can find a coun- 
terexample to, but QuickCheck cannot in an equivalent Haskell formulation 
without the use of a custom generator. 


for benchmarking SMT solvers will be publicly available [8]. 
(4) The idea of FM/hardware-in-the-loop for protocol confor- 
mance testing, where formal methods are used in the loop 
of a hardware-in-the-loop approach to protocol conformance 
testing. 

The paper is organized as follows. Section II discusses 
related work in the areas of property testing, constraint-solver 
aided test data generation, and Wi-Fi fuzzing. Section II 
describes our extensions to enumerative data types and Sec- 
tion IV describes the idea of enumerative data types with 
constraints. A full, formal description is beyond the scope 
of the paper, due to the complexity of the data definition 
framework, but we have endeavored to present the ideas 
in a way that experts will be able to adapt them to other 
languages, type systems and tools. Section V discusses aspects 
of the implementation relevant for our Wi-Fi fuzzing case 
study, described in Section VI. Conclusions are presented in 
Section VIII. 


II. RELATED WORK 


ACL2s (the ACL2 Sedan) [6], [7], is an extension of the 
ACL2 [9], [10] automated theorem prover that includes a pow- 
erful data definition framework (defdata) [5], a counterexam- 
ple generation framework (cgen) for finding counterexamples 
to conjectures [11]-[13], a power termination analysis based 
on calling-context graphs [14] and ordinals [15]-[17] and IDE 
support in the form of an Eclipse plug-in. 

QuickCheck [18] is a tool for performing property-based 
testing. It is emblematic of a family of tools that perform 
property-based testing of program without considering the 
formal semantics of those programs. Such tools are capable of 
finding many bugs, but there are many incorrect properties that 
they are highly unlikely to find counterexamples to without 
specific direction from the user. The cgen framework of 
ACL2s was inspired by QuickCheck and builds on it by 
combining random generation with theorem proving. Fig. 1 
highlights an example of a function and property that ACL2s 
can find a counterexample to, but QuickCheck cannot in an 
equivalent Haskell formulation. 

ACL2s is able to find a counterexample in the Fig. 1 exam- 
ple by making use of reasoning capabilities provided by ACL2. 
Note that cgen was able to produce this result without any 
property-specific configuration. cgen is successful because it 
is able to benefit from ACL2’s process of transforming and 
splitting up the property being tested into smaller pieces. 


cgen also makes use of random testing during counterexample 
search. This random testing is deeply entwined with ACL2s’ 
defdata data definition system for defining types [5]. cgen 
will be discussed in more detail in Section III. 

Constraint Solvers and Test Data Generation: Outside of 
ACL2, many systems have been developed that allow the 
combination constraint solvers with models or specifications 
for the purpose of test data generation. The Alloy modeling 
language and its analyzer [19] constitute one such system: see 
Sullivan et al.’s framework for automated test generation in 
Alloy [20] as well as Abdul Khalek ef al.’s use of Alloy to 
generate database management systems tests [21]. The Alloy 
analyzer’s model-finding system differs substantially in ap- 
proach from cgen—in particular, Alloy only supports bounded 
verification, meaning that it considers only a finite subset 
of all possible models, those with sizes in a user-provided 
bound, when verifying or searching for a counterexample to 
a property. Chamarthi et al. provide a detailed discussion of 
the differences between cgen and Alloy in [12], including 
that Alloy does not in general support recursive function 
definitions. 

Other purpose-built systems include PLEDGE [22] and 
TAF [23]. Some of these systems attempt to generate test data 
that satisfies some coverage criterion of the given model; this 
is an interesting goal that is not described in this paper. 

FuzzM [24] uses the JKind SMT-based model checker [25] 
to generate test data for fuzzing systems modeled in the Lustre 
programming language [26]. Depending on the complexity 
of the model provided, FuzzM may make queries to JKind 
that take a significant amount of time to solve. For this 
reason, FuzzM provides a generalization technique known as 
trapezoidal generalization [27] that can be used to generate 
many test data from a single datum produced by a query to 
JKind. Using trapezoidal generation with FuzzM can result in 
a data generation rate increase of several orders of magnitude. 

Wi-Fi and Fuzzing: The Wi-Fi family of protocols is 
extensively used to provide local-area internet connections 
in a wide variety of settings including homes, businesses, 
and universities. Therefore, bugs and vulnerabilities in Wi- 
Fi protocols and implementations thereof can have a wide 
reach. For example, the 2017 KRACK attack [28] exposed 
a vulnerability in the 4-way handshake described by the 
802.11 standard, affecting nearly every Wi-Fi device on the 
market at that time. The Wi-Fi protocols are based on the 
IEEE 802.11 standard [1], which describes the MAC (medium 
access control) and PHY (physical) layers of a network. We 
concern ourselves here with the MAC layer. The 802.11 
standard describes the binary format of MAC frames, a generic 
overview of which is shown in Fig. 2. 

Due to their prevalence, Wi-Fi protocols have previously 
been subjected to hardware-in-the-loop fuzz testing by several 
groups. In 2007, Laurent Butti and Julien Tinnés presented 
a hardware-in-the-loop approach [29] fuzzing Wi-Fi client 
drivers; this work resulted in the discovery of multiple bugs. 
Butti’s 2007 system did not model the 802.11 MAC frame 
specification, and it instead focused on generating fuzzed 
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Octets 2 2 6 Ooré Ooré Oor2 Oor6é Oor2 Oor4 
Frame | Duration | Address | Address | Address | Sequence | Address | QoS 


Control AD 1 2 3 Control 4 Control | Control 


q Laai 
MAC header 


Octets variable 4 


Frame | FCS 
Body 


Figure 9-2—MAC frame format 


Fig. 2. The binary layout of a generic 802.11 MAC frame. Figure taken from 
the IEFE 802.11-2020 standard [1]. 


frame elements and using the Scapy library [30] to generate 
packets with the appropriate structure that contain the fuzzed 
elements. More recently, Vanhoef et al. [31] described an 
approach for fuzzing access points’ implementation of the 
802.11 Wi-Fi handshake in which an abstract model of the 
Wi-Fi handshake is combined with test generation rules to 
produce test cases. These test cases consist of a sequence 
of abstract messages which are concretized into appropriate 
MAC frames when executed. This approach was able to find 
several vulnerabilities and quirks in the tested systems. In 
2019, Garbelini et al. described their Greyhound system [32], 
which uses a model of the 802.11 protocol to generate frames 
that should drive the 802.11 client device into a particular 
protocol state before sending a fuzzed frame. Using a protocol 
model also allows Greyhound to analyze responses from the 
client device to determine if the client’s responses comply 
to the 802.11 protocol. None of the aforementioned works 
regarding Wi-Fi fuzzing describe using theorem provers or 
constraint solvers to generate test data from protocol models. 
Based on our experience, we believe there would be a benefit 
to using constraint solvers in Wi-Fi protocol fuzzing, but the 
performance of existing approaches using constraint solvers is 
insufficient for use in the context. We will touch on this topic 
more in Section VI. 


II. ENUMERATIVE DATA TYPES 


The idea of enumerative data types was introduced by 
Chamarthi et al. in the context of ACL2s and its defdata 
framework [5], a rich data definition framework that allows 
one to specify and reason about user-defined types. All 
defdata types have predicative characterizations in the form 
of recognizers, functions that recognize exactly the elements 
of the type, as well as enumerative characterizations in the 
form of enumerators, functions that, given a natural number, 
return an element of the data type. Enumerators in ACL2s 
are efficient, in part because they do not involve any theorem 
proving. In this section, we provide a short overview of 
defdata and present extensions to defdata that were added 
to support our application. These extensions are publicly 
available and formally verified using ACL2s. 

The introduction of enumerative data types was partially 
motivated by counterexample generation and satisfiablity solv- 
ing. ACL2s automatically generates counterexamples to func- 
tion definitions and conjectures using a synergistic combina- 


tion of theorem proving and enumerative data types. Theorem 
proving is used to decompose and simplify conjectures, at 
which point counterexample generation algorithms use type 
inference and enumerators to randomly generate elements 
based on the types of the variables appearing in the conjecture. 
In fact, counterexample generation in ACL2s uses enumerators 
and theorem proving in a recursive fashion, e.g., after assigning 
a value to a variable, theorem proving is used to propagate 
consequences of the assignment, which may lead to further 
decompositions and simplifications as well as stronger type 
inferences, which are then exploited in further rounds of 
enumeration and theorem proving [11]-[13]. Satisfiability 
solving of ACL2s queries is performed similarly. This will 
be discussed in more detail in Section IV. 

The defdata framework includes a large collection of 
built-in types. These types include basic types such as atoms, 
symbols, characters, strings, numbers and Booleans. Subtypes 
are supported and used extensively. Examples of subtypes 
include standard, non-special characters, keywords, symbols 
corresponding to variable names, and numeric types such as 
rationals, complex rationals, non-zero rationals, positive ra- 
tionals, negative rationals, non-positive rationals, non-negative 
rationals, ratios (rationals that are not integers), positive ratios, 
negative ratios, integers, non-zero integers, natural numbers, 
positive integers, negative integers, non-positive integers, odd 
integers, even integers and zero. List and association list (alist) 
types, as well as non-empty versions, are also supported and 
are included for built-in types. There is also a universal type 
that includes all other types. 

The defdata framework allows one to easily define new 
types by providing support for singleton types, enumeration 
types and range types (numeric ranges), as well as types 
built out of existing types, such as product types, union 
types, alias types, record types, list types, alist types, recursive 
types, mutually recursive types and map types (finite partial 
functions). The framework also allows one to define custom 
types, e.g., to define the primes as a type, a user only needs 
to define a recognizer and an enumerator and then register the 
type. Custom types can then be used as if they were built-in 
to construct new types. 

Polymorphic functions are also supported by defdata, e.g., 
the form 

(listof :a)) =>:a 
(< x1 (len x2))) 


(sig nth (nat 
¿satisfies 


states that nth is a function that given a natural number and 
a list of some type :a returns a list of type :a, as long as the 
first argument (x1) is less than the length of the list (x2). 
The defdata framework automatically generates theorems 
in the form of various rules that ACL2s can use to reason about 
types using techniques such as rewriting, forward chaining, 
type reasoning, linear and non-linear arithmetic, as well as 
various decision procedures; see [9] for an in-depth discussion 
of the types of rules supported by ACL2. The framework 
includes support for specifying and reasoning about subtypes, 
e.g., it includes and generates subtype theorems for built-in 
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and user-defined types. It also generates auxiliary functions, 
such as constructors and destructors, as appropriate. 

Finally, the defdata framework includes numerous ad- 
vanced features, e.g., it allows users to select different random- 
ization schemes, to define custom enumerators and to switch 
between enumerators dynamically. 

We extended defdata by adding two libraries. The first 
library, deflist, provides support for defining list types with 
certain length and ordering constraints. The second library, 
defintrange, provides improved support for numeric range 
types over integers. The libraries are formally verified using 
ACL2 and are publicly available. 

The deflist library provides the defdata-list, 
defdata-ordered-list, and defdata-list-rng forms, 
which are used to define defdata lists whose length is 
between two natural numbers, ordered lists with length con- 
straints and lists with irregular length constraints, respectively. 
Consider the following example, derived from our Wi-Fi 
application: 

(defdata-list SR8 SRType 1 8) 


This defines the type SR8, which corresponds to lists whose 
length is between 1 and 8 (inclusive) of elements of type 
SRType, where SRType is a previous defined type recognizing 
39 numbers between 2 and 236 that correspond to certain 
supported rates, as specified by the Wi-Fi protocol. The above 
form defines a recognizer and an enumerator for such lists. 
A type corresponding to lists of SRType with no length 
constraints is generated, if it does not already exist. Various 
tables keeping track of data types are updated. Rules for 
reasoning about lists of this type are also generated, e.g., 
forward-chaining, type-prescription, compound-recognizer and 
rewrite rules that characterize the type and relate it to other 
types are automatically generated. Rules for reasoning about 
polymorphic functions and for controlling how the theorem 
prover uses these rules are also generated. This form generates 
a collection of forms totaling 7,944 lines and consisting of 
434K bytes, all of which is formally verified by the ACL2 
theorem prover. 

The defdata-ordered-list form provides a similar 
capability but also imposes the constraint that the list is 
ordered. Consider the following example, derived from our 
Wi-Fi application: 

(defdata-ordered-list BO255 uint8 0 255) 


This defines the type B0255, which corresponds to lists of 
bytes (uint8) whose length is between 0 and 255 and whose 
elements are in increasing order. This form generates all of the 
forms that defdata-list generates, as well as rules for rea- 
soning about the sorted lists. Finally, the defdata-list-rng 
form is similar to the defdata-list form, but allows one 
to specify irregular length constraints. Consider the following 
example, derived from our Wi-Fi application: 


(defdata-list-rng BTS uint8 (gen-skip 22 254 2)) 


This defines the type BTS, which corresponds to lists of bytes 
(uint8) whose length is contained in the list of numbers 


generated by the form (gen-skip 22 254 2), which in- 
cludes the numbers 22, 24,...,254. This form generates all 
of the forms that defdata-list generates, specialized to the 
irregular lengths. 

The enumerators generated by the deflist library work by 
selecting a length in the appropriate range and then generating 
that many elements of the element type. This can be done very 
efficiently. If there are ordering constraints, then the generated 
list is sorted, using a verified sorting library we developed that 
includes an efficient sorting algorithm and supports sorting 
and potentially removing duplicates in the output. If duplicates 
are not allowed by the type, then they are removed, but this 
can result in lists whose length is shorter than desired. We 
experimented with a version of the library that generated lists 
of the appropriate length and where each such list had the same 
probability of being selected (i.e., a uniform distribution), but 
that turned out to be computationally expensive for long lists. 
Therefore, once we sort the list and remove duplicates, we add 
a pass where we add elements not already in the list until we 
reach the target length. This turns out to be almost as fast as 
the non-ordered case. 

The second library, defint range, provides defintrange 
and defnatrange forms, which improved support for nu- 
meric range types over integers and natural numbers. Consider 
the following example, derived from our Wi-Fi application: 


(defnatrange uint48 (expt 2 48)) 


This defines the type uint 48 which corresponds to the natural 
numbers less than 24°. As was the case with deflist, we 
generate enumerators and rules for reasoning about the type, 
subtypes and polymorphic functions. 


IV. ENUMERATIVE DATA TYPES WITH CONSTRAINTS 


Complex data types often include type dependencies be- 
tween fields. For example, consider a stack type which con- 
tains a field corresponding to the length of the stack with 
the type invariant that the value of this field is equal to the 
length of the stack. Sometimes there are dependencies between 
types, e.g., a function may require that it is provided with two 
arguments, both of which are ordered lists of equal length. In 
this section, we show how to extend enumerative data types 
to support such constraints. The idea is relatively simple, but 
very powerful. As we show in this paper, this extension enables 
applications such as hardware-in-the-loop and theorem-prover- 
in-the-loop fuzzing of distributed protocols. 

As a simple motivational example, consider a record con- 
sisting of n fields, f1,..., fn, each of which is a list whose 
length is between 1 and 10 (inclusive). Before our work, an 
enumerator for f; would generate a list of length l, with 
1 < l < 10 with probability b However, suppose that we 
had a constraint that the size of the record, defined as the sum 
of the lengths of the fields, is 10n. The probability of that 
happening, using the defdata-generated enumerator, is ii 
which for large n is essentially 0. Or, suppose that we have a 
dependent type where the lengths of the fields are required to 
be equal. The probability of that happening is w> which 
is also essentially 0 for large n. 
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The idea of enumerative data types with constraints is 
that we allow users to define types with parameters. These 
parameters are associated with functions over the data types 
and we require that, given values for these parameters, efficient 
enumerators for the types can be defined. For example, con- 
sider a list type with a parameter corresponding to the length 
of the list; the associated function is just the length function. 
Given a particular length, it is easy to generate a list of that 
length by generating the required number of elements using 
the enumerator for the element type. The next idea is to allow 
users to define constraints over the parameters and associated 
functions of types. If these constraints are over a decidable 
fragment of logic, then enumeration winds up becoming a 
two-stage process by which we find satisfying assignments 
to the constraints, providing values for the parameters, which 
are then used by the corresponding enumerators. Consider 
the motivating example where we had fields f,..., fn with 
parameters p1,..., Dn, Corresponding to the field lengths. The 
constraint that the size of the record is 10n gets turned into 
a constraint that the sum of the lengths of the fields, is 10n 
and this can be given to an SMT/IMT solver [33]-[35]. This 
is a simple constraint, which in terms of the parameters is 
pi+-++++pn = 10n, and which only has one solution, namely 
pi = 10. With the appropriate values for the parameters, we 
can now call the enumerators for the fields of the record, which 
will generate lists of the appropriate length, with probability 1. 
In general, enumerators require solving a set of constraints and 
then calling enumerators of component types, which may also 
require solving a set of constraints, and so on, recursively. 
As an optimization, recursive constraints associated with an 
enumerator can be packaged into single queries during the 
enumerator generation process, thereby minimizing the num- 
ber of constraint-solving queries required by enumerators. 

In our Wi-Fi application, and more generally in other verifi- 
cation efforts, we want to determine the satisfiability of a set of 
ACL2s constraints which include not only various data types, 
but also other constraints arising from a variety of sources, 
including coverage criteria, responses to messages from the 
DUT, well-formedness constraints, protocol constraints and 
modeling constraints. Queries to the underlying solver consist 
of the maximal subsets of these ACL2s constraints that can 
be expressed in the theory supported by the solver. If such a 
query is unsatisfiable, so is the corresponding ACL2s query; 
if the query is satisfiable, then we have values for the data 
type parameters which can be used to efficiently (without 
constraint solving) generate satisfying assignments to the 
datatype variables. If there are any remaining constraints, they 
are handled by the ACL2s counterexample generation process. 

As we show later, we can formalize complex protocol 
interactions using types. These types include fields that are 
ordered lists over certain numbers, that have variable length 
and optional fields and that include other complex dependen- 
cies. Finding satisfying assignments to such types is difficult 
for current SMT solvers, but easy when using enumerative data 
types with constraints because we use constraint solving only 
for the true dependencies; we then we use the enumerative 


(solver-init) 

(z3-assert (x :bool y 
(and x (>= y 5) (= 

(check-sat) 

$> ;; This is SAT, so we get a model: 

((X T) (Y 5) (Z (0 0 0 0 0))) 


(:bv 3))) 
y))) 


tint z (:seq 
(seq.len z) 


Fig. 3. An example showing the use of our Common Lisp-Z3 interface. 


characterization of defdata to generate assignments using 
computation alone (i.e., no constraint solving). 


V. IMPLEMENTATION 


We implemented enumerative data types with constraints in 
ACL2s, which provides support for defining tools on top of 
ACL2s via “ACL2s systems programming” [36]. We used Z3 
as the constraint solver, which required that we integrate Z3 
with ACL2s. To this end, we developed a library allowing one 
to easily call Z3 from Common Lisp. In this section, we will 
describe both the Common Lisp-Z3 interface library, and how 
we interacted with ACL2s. 

Common Lisp-Z3 Interfacing: We decided to implement a 
close integration of ACL2 and Z3, using the CFFI Common 
Lisp library [37] to directly load Z3 into an ACL2s process 
and interact with it using Z3’s C API. Such a close integra- 
tion brings several benefits, including a low overhead when 
interacting with Z3 and the ability to support Z3 features like 
incremental solving. We developed our own Common Lisp 
library that provides both a low-level interface with Z3’s C 
API and a high-level interface that allows the user to add 
assertions to Z3 using a syntax similar to that of ACL2s’ 
property macro. See Fig. 3 for an example showing the use 
of our library. Our interface supports a broad swathe of Z3’s 
features, including many of its built-in functions and types, 
several kinds of user-generated types and incremental solving. 

ACL2s Interfacing: Since our system is implemented using 
the ACL2s systems programming paradigm, we are able to 
write Common Lisp code that calls into ACL2s. Our system 
starts inside the ACL2 read-eval-print loop (REPL), where 
we load in the ACL2s model that we will pull enumerators 
from. We then are able to exit from the ACL2 REPL into 
the underlying Common Lisp REPL that our copy of ACL2 
is built on top of, where we can load any Common Lisp 
code that we might want, including our Common Lisp-Z3 
library. To evaluate a function inside of ACL2—for example, 
an enumerator for a defdata type—we first generate an S- 
expression corresponding to the function call, and then pass 
that S-expression to the appropriate function provided by 
Walter et al.’s acl2s-interface library [38]. 

For our application, after running Z3 and getting back a 
length for each element of the structure being generated, we 
need to then generate elements with those lengths. Since each 
variable-length element has a list type corresponding to the 
set of bodies that it may have, we can make use of a special 
kind of enumerator that ACL2s produces for list types. This 
enumerator takes two arguments: the number of elements to 
generate, and the random seed to use. To generate an element 
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of a list type with a particular length, we simply call the 
enumerator with the desired length and an appropriate random 
seed. We can then construct our structure from its constituent 
parts by performing an appropriate ACL2s call. 


VI. WI-FI MODEL CASE STUDY AND EVALUATION 


We present an application of enumerative data types with 
constraints to hardware-in-the-loop 802.11 wireless router 
fuzzing. We focus on the problem of generating a particular 
kind of 802.11 MAC frame, the probe request frame, as this is 
already sufficiently complex to present the challenges in mod- 
eling and frame generation. We first describe some challenges 
that come with hardware-in-the-loop fuzzing before discussing 
the probe request frame in more depth. We then discuss two 
models of the probe request frame that we developed, the 
first using Lustre and the second using ACL2s. We highlight 
the key challenges that arose when developing the Lustre 
model, and how we were able to use ACL2s to surmount 
these challenges and produce a more concise model. We then 
describe a system that implements enumerative data types with 
constraints alongside the ACL2s model, and conclude with 
experiments showing that our enumerative data type approach 
is able to generate probe request frames at a significantly 
greater rate and for a wider range of frame sizes than either 
a pure constraint solving approach or a pure enumerative data 
type approach. 


Hardware-in-the-loop Fuzzing for Protocol Conformance 


Fuzzing a hardware system like a wireless router brings with 
it certain requirements on the fuzzer and fuzzing infrastructure. 
The device under test (DUT) needs to be monitored, an 
interface must be formed between the DUT and the fuzzer, and 
in the case of protocol fuzzing, the fuzzer may be required to 
adhere to timing constraints imposed by the DUT. The latter 
constraint means that the performance of a fuzzer may not just 
affect how long it may take to find a particular vulnerability, 
but it may entirely preclude a fuzzer from use if it cannot 
generate a fuzzed response to a message sent by the DUT 
quickly enough. 

The systems described below are intended to be one part of 
a larger hardware-in-the-loop fuzzing system, an architecture 
of which can be seen in Fig. 4. Each approach that we describe 
contains two parts: a model describing the probe request frame, 
and a fuzzer that uses the model to generate descriptions of 
concrete 802.11 probe request frames given some additional 
constraints on the size of the frame. 


The Probe Request Frame 


When a wireless device aims to connect to a 802.11 Wi-Fi 
access point, it must first gather information on the capabilities 
of wireless access points that are within range. To do this, the 
wireless device first sends out a probe request message with 
some basic information on its capabilities. Any Wi-Fi access 
point that is within range and supports at least one of the 
capabilities advertised by the wireless device will then respond 
with a probe response message containing information about 


Health 
Monitor 


Fig. 4. An overview of a hardware-in-the-loop fuzzing architecture 


itself. The wireless device will then select an access point to 
connect to and continue exchanging messages. The details of 
this process are described in the IEEE 802.11 specification [1]. 
Here we concern ourselves with the MAC frame corresponding 
to the probe request message. 

The 802.11 specification states that every MAC frame 
consists of three parts: a header, a body, and a frame check 
sequence (FCS), which is a checksum for the previous two 
parts. We will not discuss the header and FCS parts, as the 
hardware-in-the-loop testing system can take care of setting 
the header and FCS as appropriate. 

A probe request frame body consists of a variable-length 
sequence of elements, some of which are optional. Any 
elements that appear must appear in a specified order relative 
to each other. Elements typically contain a 1-byte “Element 
ID” field that has a constant value for all elements of a 
particular type, a 1-byte “Length” field that indicates the 
number of bytes remaining in the element after the end of the 
“Length” field, an optional 1-byte “Element ID Extension” 
field, and a variable-length set of element-specific fields. In 
this paper, we will consider the element-specific fields to all be 
concatenated into one “Body” field. The size of a probe request 
frame is the sum of the size of the MAC header (32 bytes) 
and the sizes of all elements appearing in the frame body. 
The 802.11 specification enumerates 33 element types for the 
probe request frame body, and the constraints on valid values 
for each element type vary widely. For example, the “DSSS 
Parameter Set” element’s body is 1 byte long and should 
specify the “Current Channel” that the device is using; the 
set of valid values depends on the PHY implementation being 
used as well as well as some other settings. The “Request” 
element’s body has a more complicated constraint: it is a 
variable-length list of bytes corresponding to “Element ID”s, 
and the bytes must be listed in increasing order. As we will 
see, such constraints are difficult to express in Lustre, and lead 
to a lengthy specification. 


The Lustre Model 


Our first model of the 802.11 probe request frame was 
developed using the Lustre programming language. When 
modeling the probe request frame body specification, we chose 
to abstract away some details of the specification in the interest 
of focusing on aspects of the specification that are interesting 
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= struct { 


type RequestElementTyp 


ElementID byte ; 
Len byte ; 
Body byte[255] }; 


--Each element of the Body field is a byte. 
node RequestElementTypeAssertions 


(e: RequestElementType) returns (r: bool); 
let 
SN 
(O0<=e.Len ) and (e.Len <=255) and 
(0<=e.Body[0]) and (e.Body[0]<=255) and 
(O0<=e.Body[254]) and (e.Body[254]<=255); 
tel 
--The first Length elements of Body are 
—-sorted. 
node RequestElementOrderedElement IDConstraint 
(e: RequestElementType) returns (r: bool); 
let 
r= 
((e.Len<l) or (e.Body[0]<e.Body[1])) and 


.Len<254) or (e.Body[253]<e.Body[254])); 


Fig. 5. A code snippet highlighting how an element containing a variable- 
length sorted array of bytes is modeled in Lustre. 


and representative. For example, we simply modeled the body 
of the DSSS parameter set element as a byte. In general, 
the Lustre model constrains the shapes of elements but not 
their body values, which we believe is reasonable considering 
the model is intended for use for fuzzing. That is, the Lustre 
model specifies probe request frame bodies that are of valid 
lengths and that have elements in the correct locations, but 
does not constrain the exact values that the body of each 
element may take to only those that are valid based on the 
802.11 specification. 

Lustre does not provide built-in support for bounded integer 
types, which means that specifying that a field is a byte is 
done by declaring that the field is an integer and that its 
value is between 0 and 255 inclusive. This becomes even more 
problematic when modeling variable-length arrays: to model 
an array of bytes of length between O and 255, the Lustre 
model specifies an array of length 255, specifies a variable 
representing the length of the array and adds a constraint 
for every element of the array stating that its value should 
be between O and 255 inclusive. This means that 255 array 
elements are always generated, and the system consuming 
values generated from the Lustre model simply omits any 
array elements that occur past the generated length value. 
Specifying the “Request” element is even more verbose, since 
in addition to the aforementioned constraints, 254 constraints 
are generated to specify that if the length of the array is greater 
than 2, the element at index 2 — 1 in the array is strictly less 
than the element at index 7. See Fig. 5 for a snippet of the 
Lustre model that defines a frame element with a variable- 
length sorted array of bytes. 

The Lustre model was used in conjunction with FuzzM to 
generate probe request frames. FuzzM was not able to generate 
assignments for certain frame sizes, as the SMT queries did 
not produce results even given a timeout of many minutes. 


7; A natural number less than 256 
(defnatrange uint8 (expt 2 8)) 
7; a list of uint8s with a length in 
(defdata-list byte255 uint8 0 255) 
77 a byte255 that is also strictly ordered 
(defdata-ordered-list byte255-increasing uint8 
0 255) 
7; Sanity check: should always be able to find 
77 a byte255 that is not a byte255-increasing 
(must-fail (property (x :byte255) 
(byte255-increasingp x))) 
7; A type for the constant 10 
(defdata exactl10 10) 
7; We model elements using records 
(defdata RequestElementTyp 
(record (ElementID exact10) 
(Body byte255-increasing) ) ) 


[0,255) 


Fig. 6. A snippet of the ACL2s model showing how an element containing 
a variable-length sorted array of bytes is modeled. Also included are sanity 
checks that do not appear in the Lustre model. 


The ACL2s Model 


We developed an ACL2s model based on the Lustre model. 
The ACL2s model makes heavy use of defdata, which has 
a much more powerful notion of types than Lustre. The 
expressiveness of ACL2s allows us to more succinctly encode 
the constraints imposed by the 802.11 standard. defdata has 
built-in support for bounded integer types, making redundant 
many of the constraints that had to be stated explicitly in 
the Lustre model. We also used the extensions described in 
Section III to define list types with length bounds and ordering 
constraints. Fig. 6 shows all of the definitions necessary to 
model the “Request” element in ACL2s with our extensions. 

Another benefit of developing our model in ACL2s is that 
we can include sanity checks inline with the model. ACL2s 
will evaluate the checks when the model is loaded during 
development, helping catch mistakes in the model specification 
that may otherwise go undetected. These checks can include 
validating that ACL2s can find a counterexample to a property 
(as seen in Fig. 6) but also may include proofs or code 
execution. If proofs are included, they may be used by ACL2s 
to prove or generate counterexamples to future conjectures. 
Even with sanity checks, the ACL2s version of the model has 
roughly a quarter of the lines of code present in the Lustre 
model. 


Evaluation 


We performed experiments to compare the performance of 
three approaches to probe request frame generation: enumer- 
ative data types using the ACL2s model and cgen (ACL2s- 
ET below), enumerative data types with constraints using the 
ACL2s model and an application-specific prototype of the 
approach described in Section IV (ACL2s-ETC below), and 
a pure constraint solving approach using a Z3-only version of 
the Lustre model and Z3 (Z3 below). 

We measured the performance of each approach when 
queried for probe request frame bodies of various sizes, 
including sizes for which no probe request frame body exists. 
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Z3 and ACL2s were both configured to timeout after 20 
seconds. ACL2s was set to use the :uniform-random cgen 
sampling method and was configured to terminate once it 
found a single counterexample rather than the default three; 
this brings its behavior more into line with Z3’s. All other Z3 
and ACL2s settings were left in their default state. We provide 
code for reproducing these experiments along with this paper. 

Fig. 7 shows the number of query responses per minute for 
each approach across a range of probe request frame body 
sizes from 0 to 5000 bytes, sampled every 10 bytes. Five 
trials were performed for each frame size for all approaches. 
The number of query responses per minute for a particular 
approach and probe request frame body size was calculated 
by dividing the total number of queries made for that size that 
resulted in definitive responses (e.g. not timeouts) by the total 
amount of time in minutes spent on all queries for that size. 

There are three regimes of frame size to discuss: 

Small invalid probe request frame sizes (0-170 bytes): We 
expected all of the approaches to perform well in this regime. 
ACL2s-ETC consistently was able to determine UNSAT across 
this range of sizes, and the Z3-only approach performed well 
up to sizes of 150 bytes. ACL2s-ET was only able to determine 
sizes up to 30 bytes were UNSAT; all of the other queries in 
this regime resulted in timeouts. Note that Z3’s performance 
begins to fall exponentially for frame sizes of 160 or greater. 
Valid probe request frame sizes (180-2740 bytes): ACL2s- 
ETC is consistently able to generate frames at a rate greater 
than 1000 per minute, while ACL2s-ET is only able to 
generate frames for a subset of the frame sizes at a rate 
of at most 22 per minute and the Z3 approach is unable 
to generate any frames with a size greater than 300 bytes. 
The distribution of ACL2s-ET’s response rate (approximately 
normally distributed around the average valid frame size of 
1456 bytes) suggests that ACL2s is falling back on random 
generation of frame bodies; that is, generating a frame body by 
independently and randomly generating each element without 
consideration of the frame size constraint. The exponential 
drop in the Z3 approach’s performance suggests that Z3’s 
search space grows exponentially with frame size. 

Large invalid probe request frame sizes (2750-5000 bytes): 
ACL2s-ETC is consistently able to quickly determine these 
sizes are UNSAT, while ACL2s-ET can do so slowly but 
consistently. The Z3 approach is always able to determine 
UNSAT, though it was only able to do so in all of the 
experimental trials in 100 of the 226 probe request frame 
sizes sampled between 2750 and 5000 bytes. This highlights 
inconsistency in Z3’s ability to determine UNSAT for large 
frame sizes. 

These results highlight the weaknesses of the Z3-only and 
ACL2s-ET approaches. The Z3 approach was able to quickly 
determine that small frame sizes are impossible and was 
consistently able to generate frames with sizes up to 210 
bytes. However, the proportion of trials that resulted in SAT 
responses began to quickly drop after that point, and no SAT 
responses were received for trials with valid packet sizes of 
290 bytes or greater. The Z3 approach’s performance was 


nm ‘Acai 
1000 OO g 
UNSAT 4 SAT SAT | UNSAT 


definitive responses per minute 


$ ef e 
ið 3 Le oe C@O@ O60, © % geo 
e Sot ®, 
Ro ‘e 
fe e ©, 
8 Se 3s oh h ENR: 
1 o f h e.. 9 voo? 
e œ ~p ° ° 
0 1000 2000 3000 4000 5000 


frame size (bytes) 


Approach e ACL2s-ET e Z3 © ACL2s-ETC © NA 


Fig. 7. The number of frames generated per minute using each of the three 
approaches when queried for frames with a given length. Only instances where 
the model returned a definitive response (e.g. not “unknown” or “timeout’’) 
are shown. The two vertical lines represent the minimum frame size and the 
maximum frame size; any responses outside of that range were all UNSAT, 
and any within that range were SAT. 


highly variable for determining that larger frame sizes are 
impossible, and though it was inconsistent, it was always able 
to show UNSAT in at least one of the five trials performed. It 
is possible that an alternative encoding of the Z3 model (for 
example, one that does not make use of Z3’s sequence types) 
would perform better, but our experience in using the Lustre 
model with FuzzM does not suggest a significant improvement 
in performance. 

The ACL2s-ET approach was consistently able to show that 
large frame sizes are impossible, and was able to generate 
frames for a wider range of frame sizes than the Z3 approach, 
though it struggled to generate large or small frames and to 
show that very small frame sizes are impossible. ACL2s is not 
using information from the frame size constraints to guide its 
counterexample generation in a meaningful way; cgen could 
be modified to improve its effectiveness here. 


VII. FUTURE WORK 


This paper introduces the idea of enumerative data types 
with constraints, or, equivalently, the idea of enumerative 
dependent types. We believe that this idea will be useful 
in many applications, e.g., those requiring the analysis and 
verification of systems and models defined using dependent 
data types. Such applications include property-based testing, 
model-based development and distributed systems. 

Below we provide a partial list of ideas for future work. 

Formalizations and extensions: We plan on developing 
and formalizing the theory of enumerative data types with 
constraints for ACL2s and encourage others to develop similar 
formalizations for other dependent type systems and inter- 
active theorem provers. We suspect that there are numerous 
interesting directions in which the basic approach can be ex- 
tended to handle dependent logics of varying expressive power. 
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A specific extension of interest involves supporting relations 
of arbitrary arity, not just predicates. Conceptually this is 
straightforward: the relation can be turned into a predicate by 
combining all of the relation’s arguments into a single value (a 
tuple or a record). Then, our approach allows us to represent 
and handle dependencies between the relation’s arguments. A 
user can manually perform the conversion from relation to 
predicate, but ideally this could be done automatically. 

ACL2 integration: We plan to provide first-class support for 
enumerative data types with constraints as part of the ACL2s 
defdata framework, so that ACL2s users can benefit from 
our work without needing to write custom code. Our proof- 
of-concept implementation used for this paper’s evaluation 
uses ACL2s systems programming [36] techniques and is not 
integrated with ACL2s. 

Optimizations: The ACL2s-ETC implementation evaluated 
in this work was not optimized, and we are confident that 
there are opportunities for both general and application- 
specific performance improvements in our method. One such 
optimization that we have experimented with in the context 
of stateful protocols is to perform offline (pre-enumeration) 
analyses of the protocol’s state machine to identify how to 
efficiently explore interesting regions of the protocol’s state 
space. This pre-analysis can significantly reduce the amount 
of work needed at enumeration time to generate appropriate 
responses to messages from the SUT. There are also interesting 
questions regarding coverage metrics and “fair” explorations 
that model analyses can help answer. 

Model extraction: One limitation of our current work is that 
it requires models that are described using dependent types. An 
interesting question whether it is possible to provide automatic 
techniques that are able to take existing models and annotate 
them with the type information requires to use our work. This 
line of research can include the use of AI techniques such as 
Natural Language Processing (NLP) to automatically translate 
legacy prose descriptions of protocols into formal models that 
can be analyzed using our approach. 


VIII. CONCLUSION 


In this paper, we introduced the idea of enumerative data 
types with constraints. This allows us to use formal-methods- 
in-the-loop in the context of hardware-in-the-loop fuzzing for 
conformance testing of distributed protocols. We presented 
a case study where we modeled a portion of the IEEE 
802.11 Wi-Fi specification and showed that we are able to 
generate messages for a wide variety of sizes, something 
that previous methods cannot do, thereby enabling the use of 
formal methods in new applications. Interesting directions for 
future work include adding such capabilities to other formal 
methods tools and using enumerative data types to analyze 
other distributed protocols. 
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Abstract—We present an alternative proof of the NEXP- 
hardness of the satisfiability of Dependency Quantified Boolean 
Formulas (DQBF). Besides being simple, our proof also gives us 
a general method to reduce NEXP-complete problems to DQBF. 
We demonstrate its utility by presenting explicit reductions from 
a wide variety of NEXP-complete problems to DQBF such as 
(succinctly represented) 3-colorability, Hamiltonian cycle, set 
packing and subset-sum as well as NEXP-complete logics such 
as the Bernays-Schönfinkel-Ramsey class, the two-variable logic 
and the monadic class. Our results show the vast applications 
of DQBF solvers which recently have gathered a lot of attention 
among researchers. 

Index Terms—Dependency quantified boolean formulas 
(DQBF), NEXP-complete problems, polynomial time (Karp) re- 
ductions, succinctly represented problems 


I. INTRODUCTION 


The last few decades have seen a tremendous development 
of boolean SAT solvers and their applications in many areas of 
computing [1]. Motivated by applications in verification and 
synthesis of hardware/software designs [2]-[8], researchers 
have recently looked at the generalization of boolean formulas 
known as dependency quantified boolean formulas (DQBF). 

While solving boolean SAT is “only” NP-complete, for 
DQBF the complexity jumps to NEXP-complete [9]. This 
makes solving DQBF quite a challenging research topic. 
Nevertheless there has been exciting progress. See, e.g., [10]- 
[18] and the references within, as well as solvers such as 
iDQ [19], dCAQE [20], HQS [21], [22] and DQBDD [23]. 
A natural question to ask is if we can use DQBF solvers to 
solve any NEXP-complete problems — similar to how SAT 
solvers are used to solve any NP-complete problems. 

In this short paper we show how to reduce a wide variety of 
NEXP-complete problems to DQBF, especially the succinctly 
represented problems that recently have found applications in 
hardware/software engineering [24]-[26]. We present another 
proof for the NEXP-hardness of DQBF. We actually give two 
proofs. The first is by a very simple reduction from succinct 
3-colorability [27]. The second is by utilizing the notion that 
we call succinct projection. It is the second one that we view 
more interesting since it gives us a general method to reduce 
any NEXP-complete problem to DQBF. 

The main idea is quite standard: We encode the accepting 
runs of a non-deterministic Turing machine (with exponential 
run time) with boolean functions of polynomial arities. How- 
ever, we observe that the input-output relation of these func- 
tions can actually be “described” by small circuits/formulas. 
Succinct projections are simply deterministic algorithms that 
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construct these circuits efficiently. This simple observation is 
a deviation from the standard definition of NEXP, that a 
language in NEXP is a language with an exponentially long 
certificate. 

Using succinct projections, we present reductions from vari- 
ous NEXP-complete problems such as (succinct) Hamiltonian 
cycle, set packing and subset sum. We believe our technique 
can be easily modified for many other natural problems. Note 
that the reduction in [9] gives little insight on how it can 
be used to obtain explicit reductions from concrete NEXP- 
complete problems. 

We also present the reductions from well known NEXP- 
complete logics such as the Bernays-Schonfinkel-Ramsey class, 
two-variable logic (FO?) and the Léwenheim class [28]-[32]. 
In fact we show that they are essentially equivalent to DQBF. 
Note that these are logics that have found applications in 
AI [33], databases [34] and automated reasoning [35], but 
lack implementable algorithms. Prior to our work, the only 
algorithm known for these logics is to “guess” a model (of 
exponential size) and then verify that it is indeed a model of 
the input formula. 

We hope that the technique introduced in this short paper 
can lead to richer applications of DQBF solvers as well as a 
wide variety of benchmarks which in turn can lead to further 
development. It is also open whether the class NEXP has a 
bona-fide problem [27]. Our paper demonstrates that DQBF 
can be a good candidate — akin to how boolean SAT is the 
central problem in the class NP. 

This paper is organized as follows. In Sect. II we review 
some definitions and terminology. In Sect. III we reprove the 
NEXP-completeness of solving DQBF. In Sect. IV and V 
we present concrete reductions from some NEXP-complete 
problems and logics to DQBF instances. The full version of 
this paper can be found in [36]. 


II. PRELIMINARIES 


Let © = {0,1}. We usually use the symbol ā&, b, € (possibly 
indexed) to denote a string in ©* with |a| denoting the length 
The length of Z is denoted by |z|. We write C(u) to denote a 
(boolean) circuit C with input gates u. When the input gates 
are not relevant or clear from the context, we simply write C. 
For a € S!"!, O(a) denotes the value of C when we assign 
the input gates u with a. All logarithms have base 2. 


This article is licensed under a Creative 
BY Commons Attribution 4.0 International License 


A dependency quantified boolean formula (DQBF) in 
prenex normal form is a formula of the form: 


W := Yri e Ven dyi(21)--: Sym(Zm) Y (1) 


where each Z; is a vector of variables from {z,,...,2,} and 
w, called the matrix, is a quantifier-free boolean formula us- 
ing variables %1,...,%n,Y1,---,Ym.- The variables %1,...,%n 
are called the universal variables, y1,...,Ym the existential 
variables and each Z; the dependency set of yi. 

A DQBF W in the form (1) is satisfiable, if for every 
1 <i < m, there is a function s; : SI?! > © such that 
by replacing each y; with s;(Z;), the formula ~ becomes a 
tautology. The function s; is called the Skolem function for y;. 
In this case, we also say that W is satisfiable by the Skolem 
functions 5),..., Sm. The problem SAT(DQBF) is defined as: 
On input DQBF W in the form (1), decide if it is satisfiable. 

Since many NEXP-complete problems use circuits as the 
succinct representations of the inputs, we allow the matrix w 
to be in circuit form, i.e., w is given as a (boolean) circuit 
with input gates 71,...,%n,Y1,---,;Ym. This does not effect 
the generality of our results, since every DQBF in circuit form 
can be converted to one in the standard formula form as stated 
in Proposition 1. 


Proposition 1. Every DOBF W in the form of (1) in circuit 
form can be converted in polynomial time into an equisatis- 
fiable DQBF formula V' whose matrix is in DNF. Moreover, 
W and WV’ have the same existential variables (with the same 
dependency set). 


The proof is by standard Tseitin’s transformation [37]. As 
an example, consider the following DQBF. 


Va41Vx2 


Ay; (1) yo(w2) (z2 V (y1 A £1 A y2)) 


It is equisatisfiable with the following DQBF. 


Va1Vaxq Vu, Vu2eVug Vv Vvo 3y1 (x1)Jy2 (x2) 


(vi © y1) A (v2 © y2) A (u1 V1 A z1 A v2) 
A(u2 e T2 V ur) A (us bod U2) 


> U3 


Intuitively, we use the extra variable vı to represent the value 
Yı, v2 the value yo, uı the value yı A 21 A Y2, uz the value 
T2 V (yı AZTIA y2) and ug the value =(z2 V (yı A 21 Y2)). 
Note that the matrix can be easily rewritten into DNF. 


II. THE NEXP-COMPLETENESS OF SAT(DQBF) 
In this section we present two new proofs that SAT(DQBF) 
is NEXP-complete, originally proved in [9]. 
Theorem 2. [9] SAT(DQBF) is NEXP-complete. 
Note that the membership is straightforward. So we will 
focus only on the hardness. 
A. The first proof: Reduction from succinct 3-colorability 


The reduction is from the problem graph 3-colorability 
where the input graphs are given in a succinct form [24]. A 
(boolean) circuit C(u,v), where |u| = |t| = n, represents a 


graph G(C) = (V, E) where V = ©” and (a,b) € E iff 
C(a,b) = 1. The problem succinct 3-colorability is defined 
as: On input circuit C, decide if G(C) is 3-colorable. This 
problem is NEXP-complete [27]. 

The reduction to SAT(DQBF) is as follows. Let C(u, v) be 
the input circuit, where |u| = || = n. We represent a 3- 
coloring of G(C) as a function g : £” — {01,10,11} which 
can be encoded by the following DQBF. 


W :=V%1V%e2 Jy: (21)3y2(21) Jys(Z2)Sy4(Z2) 


Tı =72 > (y1,y2) = (ys, ya) (2) 
A (y1,y2) # (0,0) A (ys, y4) # (0,0) (3) 
A C(%1,%2)=1 > (y1,y2) A (ys, ys) (4) 


Intuitively, we use y1, y2 and y3, ya to represent the first and 
the second bits of the image g(%1) and g(Z2), respectively. 
Lines (2) and (3) state that (y1, y2) and (y3, y4) must represent 
the same function from £” to X? and that their images do not 
include 00. Line (4) states that the colors of two adjacent 
vertices must be different. Thus, G(C) is 3-colorable iff Y is 
satisfiable. 


B. The second proof: Reduction via succinct projections 


Our second proof uses the notion of succinct projection. 
We need some terminology. Let C(ti1, 01, U2, U2) be a circuit 
with input gates %1,01,U2,t2 where |t| = |t2| = n and 
|01| = |t2| = m. We say that a function g : £” > %” agrees 
with the circuit C, if C(w1, g(wi), w2, g(w2)) = 1, for every 
wW1,W2 E€ X”. In this case, we also say that the circuit C 
describes the function g. In the following whenever we say 
that a function g : ©” > X™ agrees with C (u1, U1, U2, U2), we 
implicitly assume that n = |u1| = |U2| and m = |0;| = |2]. 


Definition 3. A succinct projection for a language L is a 
polynomial time deterministic algorithm M such that on input 
w E€ &*, M outputs a circuit C such that w € L iff there is 
a function g that agrees with C. 


Intuitively, we can view the function g as the certificate for 
the membership of w in L and the circuit C as the succinct 
description of g. Since succinct projection runs in polynomial 
time, the output circuit can only have polynomially many 
gates. The following theorem is a new characterization of 
languages in NEXP. 


Theorem 4. A language L € NEXP iff it has a succinct 
projection. 


Proof. (if) Suppose that L has a succinct projection. Consider 
the following algorithm. On input w, first use the succinct 
projection to construct the circuit C. Then, guess a function 
g (of exponential size) and verify that it agrees with C. It is 
obvious that it runs in non-deterministic exponential time. That 
it is correct follows from the definition of succinct projection. 

(only if) It is essentially the Cook-Levin reduction disguised 
in the form of function certificates. We only sketch it here. 
Let L € NEXP and M be a 1-tape NTM that accepts L in 
time 2?( for some polynomial p(n). For a word w € L of 
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length n, its accepting run can be represented as a function g : 
DP) x EPM) — OS where gli, j) denotes the content of cell i 
in time j. The tuples in the codomain ¥* encode the states and 
the tape symbols of M. To verify that g represents an accepting 
run, it is sufficient to verify that for every 21, 71,22, 72 € y(n) | 
the tuple (i1, j1, 9(¢1, j1), i2, Ja, g(i2, j2)) satisfies a certain 
property P which depends only on the input word w and the 
transitions of M. The desired succinct projection constructs in 
polynomial time a circuit C describing this property P. 


The second proof of the NEXP-hardness of SAT(DQBF): Let 
L € NEXP. The polynomial time (Karp) reduction from L to 
SAT(DQBF) is described as Algorithm 1 below. 


Algorithm 1: Reducing L € NEXP to SAT(DQBF) 

Input: w € X*. 

1: Run the succinct projection of L on w. 

2: Let C (z1, Y1, Z2, Y2) be the output circuit where 
lzi] = |%2| = n, [p| = [Bol = m, J1 = (Yas. 
and Y = (Y2,1,---;Y2,m): 

3: Output the following DQBF W: 

VE1VEo Jy1,1 (z1) IY1,m (T21) Ay2,1 (£2) +> Iy2,m (T2) 
C(#1, 912,92) A (1 = 32 > th = Bo) 


Yim) 


We show w € L iff W is satisfiable. Suppose w € L. Let 
g: X” — X” be a function that agrees with C. For each 
1 < ¿i < m, define the Skolem function s; : X” — X where 
8;(@) is the i-th component of g(@), for every a € X”. It is 
routine to verify that Y is satisfiable with each s; being the 
Skolem function for y1, and yo ;. 

Conversely, suppose W is satisfiable. Let sj; : X” — X be 
the Skolem function for yj i, where 1 < j < 2and1<i<™m. 
Since zı = Z2 — Yj, = Yo, the functions s;,; and s2, must 
be the same, for every 1 < i < m. Define g : X” => X&™ 
where g(@) = (s1(@),...,51,m(@)) for every @ € X”. Since 
C (a1, 9(G1), G2, g(ã2)) is true for every G1, Ge, the function g 
agrees with C’. That is, there is a function that agrees with C. 
Hence, w € L. This completes the second proof. 


Remark 5. Observe that when Theorem 4 is applied to 
languages in NP, the accepting run of a non-deterministic 
Turing machine with polynomial run time p(n) is represented 
as a function g Diosr) x ylegp(n) _, yf and the 
succinct projection outputs a circuit C(%1, 91,2, Y2) where 
|Z1| = |%2| = logp(n) and |j,| = |%.| = £. Thus, for 
L € NP, the DQBF output by Algorithm 1 has 4log p(n) 
universal variables and 2¢ existential variables. 


IV. SOME CONCRETE REDUCTIONS 


In this section we show how to utilize succinct projection to 
obtain the reductions from concrete NEXP-complete problems 
to SAT(DQBF). These are (succinct) Hamiltonian cycle, set 
packing and subset sum [27]. We use the notion of succinctness 
from [24] which has been explained in Sect. II-A. By Algo- 
rithm 1, it suffices to present only the succinct projections. 


Some useful notations: For an integer k > 1, [k] denotes 
the set {0,...,4 —1}. For i € [2”], bin,,(2) is the binary rep- 
resentation of 7 in n bits. The number represented by a@ € X” 
is denoted by num(a@). For a,b € ©”, if num(@) = num(b) +1 
(mod 2”), we say that @ is the successor of b, denoted by 
a = b +1. Note that successor is applied only on two strings 
with the same length and the successor of 1” is 0”. It is not 
difficult to construct a circuit C(Z, y) (in time polynomial in 
|z| + |y|) such that C(a,b) = 1 iffa =b+ 1. 

Reduction from succinct Hamiltonian cycle: Succinct 
Hamiltonian cycle is defined as follows. The input is a circuit 
C(u,v). The task is to decide if there is a Hamiltonian cycle 
in G(C). 

Let C(t, U) be the input circuit where |u| = |u| = n. We 
use a function g : X” — X” to represent a Hamiltonian cycle 
(bo,...,b2»_1) where g(bin,(i)) = b;, for every i € [2”]. To 
correctly represent a Hamiltonian cycle, the following must 
hold for every @1, dz € X”. 

(H1) If a, Æ ā2, then g(a) Æ g(ā2). 

(H2) If Gg = ā + 1, then (g(@), g(ā2)) is an edge in G(C). 

The succinct projection for succinct Hamiltonian cycle simply 

outputs the circuit that expresses (H1) and (H2), i.e., it outputs 

the following circuit D(z1, Y1, T2, Y2) where |%| = |%2| = 

lil = Hol = m: 

(z1 A 2 > Hy A Go) A (2 = z1 +1 > CCH, p2) = 1) 
Obviously, a function g : X” — X” represents a hamiltonian 
cycle in G(C) iff it agrees with D. 

Reduction from succinct set packing: In the standard 
representation the problem set packing is defined as follows. 
The input is a collection K of finite sets S),...,S¢ C u™ 
and an integer k. The task is to decide whether K contains 
k mutually disjoint sets. We assume each S; has a “name” 
which is a string in 08’. 

The succinct representation of the sets S1, ... , Sọ is a circuit 
C(u,v) where |u| = m and |u| = log £. A string a € X” is in 
the set Sz, if C(@,b) = 1. We denote by K(C) the collection 
of finite sets defined by the circuit C. The problem succinct set 
packing is defined analogously where the input is the circuit 
C(u,v) and an integer k (in binary). 

We now describe its succinct projection. Let C(u,v) and k 
be the input where |u| = m and |u| = n. We first assume that 
k is a power of 2. We represent k disjoint sets S,,..., Sp in 
K(C) as a function g : £98% x O™ — X” where g(bin(i), a) 
is the name of the set S;. Note that the string @ is actually 
ignored in the definition of g. 

For a function g plsk x ym > E” to correctly 
represent k disjoint sets, the following must hold for every 
(ā1, b1), (Gz, b2) E Dsk x Dm, 

(P1) If āı = āū, then g(@,b1) = g(ā2,b2). That is, the 
function g does not depend on b; and be. 

(P2) If G@, A G and bı = bo, then C(b,, g(ā1,b1)) = 0 or 
C(b1, g(ā2,b2)) = 0. That is, the element bı is not in 
the sets whose names are (G1, 61) and g(ā2, b2). 

It is routine to verify that g represents k disjoint sets iff 

(P1) and (P2) hold for every (a1, b1), (@2,b2) € XVEF x X™., 
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The succinct projection outputs the following circuit D that 
formalizes (P1) and (P2): 
(zı = ť2 `> žŽ1ı = 22) 

A(z # t2 ^J, = Yo) > ~(C g, 21) = C (J1, 22) = 1) 
If k is not a power of 2, we conjunct both atoms Z, = 2 
and zı Æ zə with a circuit that tests whether the numbers 
represented by the bits zı and Zə is an integer in [k]. Such a 
circuit can be easily constructed in polynomial time in [log k]. 

Reduction from succinct subset-sum: In the standard 
representation the instance of subset-sum is a list of positive 
integers so,...,S,— 1 and ¢ (all written in binary). The task is 
to decide if there is a subset X C [k] such that J`; x 5i = t. 
Such X is called the subset-sum solution. The succinct rep- 
resentation is defined as two circuits Cy(ti1,0) and C2(t2), 
where |ū1| = max;e,x] log si, |v] = logk and |ū2| = logt. 
Circuit C4 defines the numbers s;’s where C}(G, ) is the i-th 
least significant bit of sj, where i = num(@) and j = num(b). 
Circuit C2 defines the number t where C2(@) is the i-th 
least significant bit of t, where i = num(@). The subset-sum 
instance represented by C4 and C% is denoted by N (C1, C2). 
We will describe the succinct projection for succinct subset- 
sum. 

Let Ci (u1, V) and C2(tz2) be the input where |u1| = |t2| = 
n and || = m. We need a few notations. Let sg,..., S2m—1 be 
the numbers represented by Cı and t the number represented 
by C2. For a set X C [2™], let Tx = Voie y Si- For0<j< 
2™, let Tx; = Txnjj). Abusing the notation, for be X”, we 
write sz; and Tx 5 to denote s; and Ty ;, respectively, where 
i = num(b). For a € ©”, bit-@ means bit-i where i = num(a). 

We represent a set X C [2”] as a function g : £” x H™ > 
X5 where g(a, b) = (a, B, y, ô, €) such that: 

e a=1iffs, €X. 

e 6 is bit-a in Ty 5. 

e 7 is the carry of adding Tx 5 and sz up to bit-(@ — 1). 

e ôe =6B+7+C(G,b), i.e., € is the least significant bit of 

B+7+C(G,b) and 6 is the carry. 
See the illustration below. 


bit-0 to bit-(@ — 1) in Ty 5 


Txs: = bit-a@ in Ty 5 


y— — 5 


bit-0 to bit-(@ — 1) in sz 


B 
O(a, b) 
l 


Intuitively, g(ā, b) contains the information about the additions 
performed on bit-a@ in s (with respect to the set X). In 
particular, the bits of the number Tx are all contained in 
g(G@,1™) for every @ € X”. These bits can then be compared 
to those in t by means of the circuit C. 

Note that for a function g : X” x E™ — Ð’ to properly 
represent a number Ty, for some X C [2™], it suffices to 
check the values of g on “neighbouring” points in X” x X™. 
More precisely, the following conditions must be satisfied 
for every (G1, b1), (G2,b2) € E” x E™, where g(a1,b1) = 
(a1, 61,71, 61, €1) and g(G2,b2) = (a2, Be, Y2, b2, €2). 


di) If bı = bo, then a, = ag. That is, the value ay depends 
only on the index of a number. 
Gi) If a; = 0, then 1 = 6, = 0 and By = €]. 
Gii) If ay = 1, then yı + C(ā1, b1) + By = O1€1. 
(iv) If a; = 0”, then yı = 0. 
(v) If a; = 1”, then 6; = 0. 
(vi) If bı = 0”, then 8; = y, = 0. 
(vii) If bı = 1%”, then e, = C2(ā1). 
(viii) If ay = 1 and bı = by and ā = @ + 1, then 6, = %2. 
(ix) If a; = 1 and bz = bı + 1 and @ = Gj, then e, = fo. 


Intuitively, (ii) and (iii) state that the values of 
(a1, (1,71,61,€1) must have their intended meaning, 
i.e., when a; = 0, no addition is performed and when 


a, = 1, the addition yı + C(ā1,b1) + 8; is performed and 
the result is 6,€,. (iv) states that there is no carry from the 
previous bit when considering the least significant bit. (v) 
states that there shouldn’t be any carry after adding the most 
significant bit (if we want Tx equals t). (vi) states that Tx 9 
must be zero. (vii) states that bit-@ in Tx must equal to 
bit-@ in t. Finally, (viii) and (ix) state that when (G1, 6,) and 
(Gz, bz) are neighbors, the bits 81,71, 61, €1 and 82, 72, 62, €2 
must obey their intended meaning. 

Obviously, if g satisfies (i)—(ix), then it represents a set X 
such that Ty = t. Conversely, if there is a set X such that 
Tx = t, then there is a function g that satisfies (4)-(x). It is 
not difficult to design a succinct projection that constructs a 
circuit D that describes functions that satisfy (i)-(ix). 


V. REDUCTIONS FROM OTHER NEXP-COMPLETE LOGICS 
In this section we will consider the following fragments of 
relational first-order logic (with the equality predicate): 


e The Bernays-Schénfinkel-Ramsey (BSR) class: The class 
of sentences of the form: 


Wy = 3x, - 


Atm Vyr: Wyn Y 


where w is a quantifier-free formula. 

e The two-variable logic (FO?): The class of sentences 
using only two variables x and y. 
The classic result by Scott [38] states that every FO? 
sentence can be transformed in linear time into an equi- 
satisfiable FO? sentence of the form: 


Wo := VaVy a(x, y) A N vaAyBi(a,y) 
i=1 


for some m > 1, where a(x,y) and each §;(a,y) are 
quantifier free formulas. 

e The Léwenheim/monadic class: The class of sentences 
using only unary predicate symbols. Sentences in this 
class are also known as monadic sentences. 


Let SAT(BSR), SAT(Mon) and SAT(FO*) denote their cor- 
responding satisfiability problems. It is well known that all 
of them are NEXP-complete [28]—[32]. The upper bound is 
usually established by the so called Exponential Size Model 
(ESM) property stated as follows. 
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e If the BSR sentence W;, is satisfiable, then it is satisfiable 
by a model with size at most m + 1 [31, Prop. 6.2.17]. 

e If the FO? sentence Wo is satisfiable, then it is satisfiable 
by a model with size m2”, where n is the number of 
unary predicates used [30]. 

e If a L6wenheim sentence is satisfiable, then it is sat- 
isfiable by a model with size at most r2”, where r 
is the quantifier rank and n is the number of unary 
predicates [31, Prop. 6.2.1]. 

The main idea of the reduction to SAT(DQBF) is quite 
simple. We will represent the domain of a model with size 
at most N as a subset of ©’, where t = log N and use a 
function fy : 4’ — Ð as the indicator whether an element 
is in the domain. Every predicate in the input formula can 
be represented as a function f : %** — © where k is 
the arity of the predicate. All these functions can then be 
encoded appropriately as existential variables in DQBF. Note 
that the universal FO quantifier Vx- can be encoded as 
Vu fo(t%) —> ---. The existential FO quantifier can first be 
Skolemized and then encoded as existential variables in DQBF. 

The rest of this section is organized as follows. For technical 
convenience, we first introduce the logic Existential Second- 
order Quantified Boolean Formula (SSOQBF) — an alterna- 
tive, but equivalent formalism of DQBF. The only difference 
between SSOQBF and DQBF is the syntax in declaring the 
function symbol. Then, we consider the problem that we call 
Bounded FO satisfiability, denoted by Bnd-SAT(FO), which 
subsumes all SAT(BSR), SAT(FO*) and SAT(Mon) and show 
how to reduce it to SAT(SSOQBF). 

The logic ISOQBF: The class SSOQBF is the extension 
of quantified boolean formulas (QBF) with existential second- 
order quantifiers, i.e., formulas of the form: 


Ų := Afiafe---Afp Qivi-:: Qnvn p 


where each Q; € {V,3} and each f; is a boolean function 
symbol associated with a fixed arity ar(f;). The formula ~ is 
a boolean formula using the variables v;’s and f(Z)’s, where 
f e{hfi,..., fp}. |z| = ar(f) and z C {v1,..., ug}. We call 
each f(Z) in y a function variable. 

The semantics of Y is defined naturally. We say that Y is 
satisfiable, if there is an interpretation F; : DYU) — 5 for 
each f; such that Q)v1--- QnUn Y is a true QBF. In this case 
we say that F),...,/, make W true. It is not difficult to see 
that DQBF and SSOQBF can be transformed to each other in 
linear time while preserving satisfiability. 

Bounded FO satisfiability (Bnd-SAT(FO)): The problem 
Bnd-SAT(FO) is defined as: On input relational FO sentence y 
and a positive integer N (in binary), decide if p has a model 
with cardinality at most N. It is a folklore that Bnd-SAT(FO) 
is NEXP-complete. Note that due to the ESM property, 
Bnd-SAT(FO) trivially subsumes all SAT(BSR), SAT(FO*) and 
SAT(Mon). 

Reduction from Bnd-SAT(FO) to SAT(ASOQBF): Let p 
and N be the input to Bnd-SAT(FO). We may assume that » 
is in the Prenex normal form: y := Q1%1---Qn@n Y, where 
each Q; € {V,4} and w is quantifier-free formula. Adding 


redundant quantifier, if necessary, we assume that Qı is V. 
Then, we Skolemize each existential quantifier as follows. Let 
i be the minimal index where Q; = J. We rewrite ọ into: 


p' := Yxi: Var Qasr 8541 -+-Qntn V2 
— g(x1, ae ,Zi—1) =? y 


where z is a fresh variable, g is the Skolem function represent- 
ing the existentially quantified variable x; and w’ is obtained 
from w by replacing every occurrence of x; with z. Hence, 
we may assume that the input sentence y is of form: 


p:=Va1-:: Vin Y (5) 


where yw is quantifier-free formula where every (Skolem) 
function symbol g(21,...,2;~1) only occur in the equality 
predicate z = g(x1,...,%;-1) and z is one of 2j,...,2n. 

Let gi,.--, 9 be the Skolem function symbols in 7% and 
P,,..., Py be the predicates in y. Let ar(g;) and ar(P;) denote 
the arity of g; and P;. Let t = [log N]. Construct the following 
ASOQBF formula: 


© = Ifo If Bes Baa Bee Ifp Afr, 
2 p. ui = aes fo) 
PER (i Nii pā) > a m 
where: 


e The arity of fo is t. 

e For every 1 < i < k, the arity of fii,..., fiz is t-ar(gi). 

e For every 1 < i < 4, the arity of fp,,..., fp, is t-ar(P;). 

e For every 1 <i Sn, |u| = t. 

The formula W is obtained from 7 as follows. 

e Each predicate P;(x;,,.. 
fp, (Uj, ; 1225 Uj) 

e Each predicate x; = g;(%;,,...,2j,,) is replaced with 
Uj = (fir (ty, TE ETA Bares fit Uji; TE sjm )) 

e Each predicate xj = x; is replaced with u; = uj. 


.,@;,,) is replaced with 


Intuitively, we use fọ as the indicator to determine whether 
a string in Xt is an element in the model. To ensure that 
the model is not empty, we insist that 0° belongs to the 
model, hence, the formula t; = 0° — fo(t1). We use the 
vector of variables u; to represent x;. For every 1 <i <S k, 
the functions f;,1,..., fi + represent the bit representation of 
Gi(Lj,,---,%;,,). Finally, for every 1 <i < 4, the function fp, 
represents the predicate P;. Note the part Ag; fo(ui) > Y 
which means we require W to hold only on the vectors 
U1,..-, Un that “passes” the function fo, i.e., they are elements 
of the model. It is routine to verify that the formula y in 
Eq. (5) is satisfiable by a model with cardinality at most N 
iff the JSOQBF formula ® in Eq. (6) is satisfiable. 
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Abstract—The fundamental problem of weighted sampling 
involves sampling of satisfying assignments of Boolean formulas, 
which specify sampling sets, and according to distributions 
defined by pre-specified weight functions to weight functions. The 
tight integration of sampling routines in various applications has 
highlighted the need for samplers to be incremental, i.e., samplers 
are expected to handle updates to weight functions. 

The primary contribution of this work is an efficient knowledge 
compilation-based weighted sampler, INC', designed for incre- 
mental sampling. INC builds on top of the recently proposed 
knowledge compilation language, OBDD[/], and is accompanied 
by rigorous theoretical guarantees. Our extensive experiments 
demonstrate that INC is faster than state-of-the-art approach for 
majority of the evaluation. In particular, we observed a median 
of 1.69 runtime improvement over the prior state-of-the-art 
approach. 

Index Terms—knowledge compilation, sampling, weighted 
sampling 


I. INTRODUCTION 


Given a Boolean formula F and weight function W, 
weighted sampling involves sampling from the set of satisfying 
assignments of F according to the distribution defined by 
W. Weighted sampling is a fundamental problem in many 
fields such as computer science, mathematics and physics, with 
numerous applications. In particular, constrained-random sim- 
ulation forms the bedrock of modern hardware and software 
verification efforts [1]. 

Sampling techniques are fundamental building blocks, and 
there has been sustained interest in the development of 
sampling tools and techniques. Recent years witnessed the 
introduction of numerous sampling tools and techniques, from 
approximate sampling techniques to uniform samplers SPUR 
and KUS, and weighted sampler WAPS [2]-[6]. Sampling 
tools and techniques have seen continuous adoption in many 
applications and settings [7]—[12]. The scalability of a sampler 
is a consideration that directly affects its adoption rate. There- 
fore, improving scalability continues to be a key objective for 
the community focused on developing samplers. 

The tight integration of sampling routines in various applica- 
tions has highlighted the importance for samplers to handle in- 
cremental weight updates over multiple sampling rounds, also 
known as incremental weighted sampling. Existing efforts on 
improving scalability typically focus on single round weighted 
sampling, and might have overlooked the incremental set- 
ting. In particular, existing approaches involving incremental 
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weighted sampling typically employ off-the-shelf weighted 
samplers which could lead to less than ideal incremental 
sampling performance. 

The primary contribution of this work is an efficient scalable 
weighted sampler INC that is designed from the ground up to 
address scalability issues in incremental weighted sampling 
settings. The core architecture of INC is based on knowledge 
compilation (KC) paradigm, which seeks to succinctly repre- 
sent all satisfying assignments of a Boolean formula with a 
directed acyclic graph (DAG) [13]. In the design of INC, we 
make two core decisions that are responsible for outperforming 
the current state-of-the-art weighted sampler. Firstly, we build 
INC on top of PROB (Probabilistic OBDD[A] [14]) which 
is substantially smaller than the KC diagram used in the 
prior state-of-the-art approaches. Secondly, INC is designed to 
perform annotation, which refers to the computation of joint 
probabilities, in log-space to avoid the slower alternative of 
using arbitrary precision math computations. 

Given a Boolean formula F and weight function W, INC 
compiles and stores the compiled PROB in the first round 
of sampling. The weight updates for subsequent incremental 
sampling rounds are processed without recompilation, amor- 
tizing the compilation cost. Furthermore, for each sampling 
round, INC simultaneously performs annotation and sampling 
in a single bottom-up pass of the PROB, achieving speedup 
over existing approaches. We observed that INC is significantly 
faster than the existing state-of-the-art in the incremental 
sampling routine. In our empirical evaluations, INC achieved 
a median of 1.69 runtime improvement over the state-of- 
the-art weighted sampler, WAPS [6]. Additional performance 
breakdown analysis supports our design choices in the de- 
velopment of INC. In particular, PROB is on median 4.64x 
smaller than the KC diagram used by the competing approach, 
and log-space annotation computations are on median 1.12x 
faster than arbitrary precision computations. Furthermore, INC 
demonstrated significantly better handling of incremental sam- 
pling rounds, with incremental sampling rounds to be on 
median 5.9% of the initial round, compared to 67.6% for 
WAPS. 

The rest of the paper is organized as follows. We first in- 
troduce the relevant background knowledge and related works 
in Section II. We then introduce PROB and its properties in 
Section HI. In Section IV, we introduce our weighted sampler 
INC, detail important implementation decisions, and provide 
theoretical analysis of INC. We then describe the extensive 
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empirical evaluations and discuss the results in Section V. 
Finally, we conclude in Section VI. 


II. BACKGROUND AND RELATED WORK 


Knowledge Compilation: Knowledge compilation (KC) 
involves representing logical formulas as directed acyclic 
graphs (DAG), which are commonly referred to as knowledge 
compilation diagrams [13]. The goal of knowledge compila- 
tion is to allow for tractable computation of certain queries 
such as model counting and weighted sampling. There are 
many well-studied forms of knowledge compilation diagrams 
such as d-DNNF, SDD, BDD, ZDD, OBDD, AOBDD, and the 
likes [15]-[21]. In this work, we build our weighted sampler 
upon a variant of OBDD known as OBDD[A] [14]. 

OBDD[/A]: Lee [15] introduced Binary Decision Dia- 
gram (BDD) as a way to represent Shannon expansion [22]. 
[16] introduced fixed variable orderings to BDDs (known as 
OBDD) [16] for canonical representation and compression 
of BDDs via shared sub-graphs. Lai et al. [14] introduced 
conjunction nodes to OBDDs (known as OBDD[A]) [14] to 
further reduce the size of the resultant DAG to represent a 
given Boolean formula. In this work, we parameterize an 
OBDD[A] to form a PROB that is used for weighted sampling. 

Sampling: A Boolean variable x can be assigned either 
true or false, and its literal refers to either x or its negation. A 
Boolean formula is in conjunctive normal form (CNF) if it is 
a conjunction of clauses, with each clause being a disjunction 
of literals. A Boolean formula F is satisfiable if there exists an 
assignment 7 of its variables such that the F evaluates to true. 
The model count of Boolean formula F refers to the number 
of distinct satisfying assignments of F. 

Weighed sampling concerns with sampling elements from a 
distribution according to non-negative weights provided by a 
user-defined weight function W. In the context of this work, 
weighted sampling refers to the process of sampling from 
the space of satisfying assignments of a Boolean formula F. 
The weight function W assigns a non-negative weight to each 
literal l of F. The weight of an assignment 7 is defined as the 
product of the weight of its literals. 

WAPS: KUS [5] utilizes knowledge compilation tech- 
niques, specifically Deterministic Decomposable Negation 
Normal Form (d-DNNF) [19], to perform uniform sampling 
in 2 passes of the d-DNNF. Annotation is performed in the 
first pass, followed by sampling. WAPS [6] improves upon 
KUS by enabling weighted sampling via parameterization of 
the d-DNNF. WAPS performs sampling in a similar manner 
to KUS, the main difference being that the annotation step 
in WAPS takes into account the provided weight function. In 
contrast, we introduce INC which performs weighted sampling 
in a single pass by leveraging the DAG structure of PROB. 

Knowledge compilation-based samplers typically perform 
incremental sampling as follows. The sampling space is first 
expressed as satisfying assignments of a Boolean formula, 
which is then compiled into the respective knowledge compila- 
tion form. In the following step, samples are drawn according 
to the given weight function W. Subsequently, the weights 


are updated depending on application logic and weighted 
sampling is performed again. The process is repeated until 
an application-specific stopping criterion is met. An example 
of such an application would be the Baital framework [10], 
developed to use incremental weighted sampling to generate 
test cases for configurable systems. 


III. PROB: - PROBABILISTIC OBDD[A] 


PROB is a DAG composed of four types of nodes - 
conjunction, decision, true and false nodes. The internal nodes 
of a PROB consist of conjunction and decision nodes whereas 
the leaf nodes of the PROB consist of true and false nodes. 
A PROB is recursively made up of sub-PROBs that represent 
sub-formulas of Boolean formula F. We use VarSet(n) to 
refer to the set of variables of F represented by a PROB with 
n as the root node. Subdiagram(n) refers to the sub-PROB 
starting at node n and Parent(7) refers to the immediate parent 
of node n in PROB. 


A. PROB Structure 


Conjunction node (/-node): A /-node ne represents 
conjunctions in the assignment space. There are no limits to 
the number of child nodes that ne can have. However, the 
set of variables (VarSet(-)) of each child node of ne must be 
disjoint. An example of a A-node would be n2 in Figure 1. 
Notice that VarSet(n4) = {z} and VarSet(n5) = {y} are 
disjoint. 

Decision node: A decision node ng represents decisions 
on the associated Boolean variable Var(na) in Boolean for- 
mula F that the PROB represents. A decision node can have 
exactly two children - lo-child (Lo(nqa)) and hi-child (Hi(nq)). 
Lo(nq) represents the assignment space when Var(na) is set to 
false and Hi(nq) represents otherwise. 0,, ap, and na, refer to 
the parameters associated with the edge connecting decision 
node nq with Hi(na) and Lo(na) respectively in a PROB. 
Node n1 in Figure | is a decision node with Var(n1) = zx, 
Hi(n1) = n3 and Lo(n1) = n2. 

True and False nodes: True (T) and false (L) nodes are 
leaf nodes in a PROB. Let 7 be an assignment of all variables 
of Boolean formula F and let PROB w represent F. 7 corre- 
sponds to a traversal of from the root node to leaf nodes. The 
traversal follows 7 at every decision node and visits all child 
nodes of every conjunction node encountered along the way. T 
is a satisfying assignment if all parts of the traversal eventually 
lead to the true node. 7 is not a satisfying assignment if any 
part of the traversal leads to the false node. With reference to 
Figure 1, let 7, = {x,y,z} and T2 = {x,y,z}. For 7, the 
traversal would visit n1,n3,n6,n7,n9, and 7, is a satisfying 
assignment since the traversal always leads to T node (n9). 
As a counter-example, Tə is not a satisfying assignment with 
its corresponding traversal visiting n1,n3,n6,n7,n8,n9. T2 
traversal visits L node (n8) because variable z +> true in T2 
and Hi(n6) is node n8. 


B. PROB Parameters 


In the PROB structure, each decision node ng has two pa- 
rameters O.o(n,) and PHi(n,), associated with the two branches 
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Fig. 1: A smooth PROB yı with 9 nodes, n1,...,n9, rep- 
resenting F = (x V y) A (~x V 72). Branch parameters are 
omitted 


of na, which sums up to 1. Lo(n,) is the normalized weight 
of the literal =Var(nq) and similarly, @yin,) is that of the 
literal Var(na). One can view O,o(n,) to be the probability of 
picking —Var(nq) and Oy4i(,,,) to be that of picking Var(na) by 
the determinism property introduced later. Let x; be Var(nq). 
Given a weight function W: 


totma) Waor) + W (2s) 


C. PROB Properties 


W (ai) 
W(-2;) + W (xi) 


DHi(na) = 


The PROB structure has important properties such as de- 
terminism and decomposability. In addition to the determinism 
and decomposability properties, we ensure that PROBs used in 
this work have the smoothness property through a smoothing 
process (Algorithm 1). 


Property 1 (Determinism). For every decision node na, the set 
of satisfying assignments represented by Hi(ng) and Lo(na) 
are logically disjoint. 


Property 2 (Decomposability). For every conjunction node 
Ne, VarSet(c;)MVarSet(c;) = 0 for all c; and cj where ci, cj € 
Child(n.) and ci £ cj. 


Property 3 (Smoothness). For every decision node nq, 
VarSet(Hi(na)) = VarSet(Lo(na)). 


D. Joint Probability Calculation with PROB 


In Section II-B, we mention that one can view the branch 
parameters as the probability of choosing between the positive 
and negative literal of a decision node. Notice that because of 
the decomposability and determinism properties of PROB, it 
is straightforward to calculate the joint probabilities at every 
given node. At each conjunction node ne, since the variable 
sets of the child nodes of ne are disjoint by decomposability, 
the joint probability of ne is simply the product of joint 
probabilities of each child node. At each decision node na, 
there are only two possible outcomes on Var(na) - positive 
literal Var(na) or negative literal —Var(na). By determinism 
property, the joint probability is the sum of the two possible 


scenarios. Formally, the calculations for joint probabilities P’ 
at each node in PROB are as follows: 


P’ of A-node ne = II P'(c) (EQ1) 
cEChild(ne) 
P’ of decision-node ng = OLo(na) X P'(Lo(na)) (EQ2) 


+ IHi(na) x P’(Hi(na)) 


For true node n, P’(n) = 1 because it represents satisfying 
assignments when reached. In contrast P’(n) = 0 when n 
is a false node as it represents non-satisfying assignments. In 
Proposition 2, we show that weighted sampling is equivalent 
to sampling according to joint probabilities of satisfying 
assignments of a PROB. 


IV. INC - SAMPLING FROM PROB 


In this section, we introduce INC - a bottom-up algorithm 
for weighted sampling on PROB. We first describe INC for 
drawing one sample and subsequently describe how to extend 
INC to draw k samples at once. We also provide proof of 
correctness that INC is indeed performing weighted sampling. 
As a side note, samples are drawn with replacement, in line 
with the existing state-of-the-art weighted sampler [6]. 


A. Preprocessing PROB 


In the main sampling algorithm (Algorithm 2) to be intro- 
duced later in this section, the input is a smooth PROB. As a 
preprocessing step, we introduce Smooth algorithm that takes 
in a PROB w and performs smoothing. 

The Smooth algorithm processes the nodes in the input 
PROB ~% in a bottom-up manner while keeping track of 
VarSet(n) for every node n in w using a map «. True and 
false nodes have Ú as they are leaf nodes and do not represent 
any variables. At each conjunction node, its variable set is the 
union of variable sets of its child nodes. 

The smoothing happens at decision node n in w when 
VarSet(Lo(nm)) and VarSet(Hi(n)) do not contain the same 
set of variables as shown by lines 8 and 16 of Algorithm 1. 
In the smoothing process, a new conjunction node (/cNode 
for Lo(n) and rcNode for Hi(n)) is created to replace the 
corresponding child of n, with the original child node now 
set as a child of the conjunction node. Additionally, for each 
of the missing variables v, a decision node representing v is 
created and added as a child of the respective conjunction 
node. The decision nodes created during smoothing have both 
their lo-child and hi-child set to the true node. To reduce 
memory footprint, we check if there exists the same decision 
node before creating it in the checkMakeTrueDecisionNode 
function. 

As an example, we refer to w2 in Figure 2. It is obvious 
that 2 is not smooth, because VarSet(Lo(n1)) = {y} and 
VarSet(Hi(n1)) = {z}. In the smoothing process, we replace 
Lo(n1) with a new conjunction node n2 and add a decision 
node n4 representing missing variable z, with both child set 
to true node n9. We repeat the steps for Hi(n1) to arrive at 
PROB vw in Figure 1. 
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Algorithm 1 Smooth - returns a smoothed PROB 


Input: PROB 4% 
Output: smooth PROB 


1: K © initMap() 

2: for node n of ~ in bottom-up order do 
3: if n is true node or false node then 
4 kin] 0 

5: else if n is A-node then 
6 &[n] + unionVarSet(Child(n), «) 
7: else 

8 if «[Hi(n)] — x[Lo(n)] # then 
9 Iset + K[Hi(n)] — k[Lo(n)] 


10: IcNode + new A —node() 

11: IcNode.addChild(Lo(n)) 

12: for var v in Iset do 

13: dNode + checkMakeTrueDecisionNode(v) 
14: IcNode.addChild(dNode) 

15: Lo(n) <IcNode 

16: if «[Lo(n)] — «[Hi(n)] # Ø then 

17: rset + «[Lo(n)] — «[Hi(n)] 

18: rcNode + new A —node() 

19: rcNode.addChild(Hi(n)) 

20: for var v in rset do 

21: dNode + checkMakeTrueDecisionNode(v) 
22: rcNode.addChild(dNode) 

23: Hi(n) <-rcNode 

24: &[n] + Var(n) U unionVarSet({Hi(m), Lo(n)}) 
25: return 1 


Fig. 2: A PROB wz representing Boolean formula F = (x V 
y) A (=a V ~z), branch parameters are omitted 


B. Sampling Algorithm 


INC takes a PROB w representing Boolean formula F and 
draws a sample from the space of satisfying assignments of F, 
the process is illustrated by Algorithm 2. INC performs sam- 
pling in a bottom-up manner while integrating the annotation 
process in the same bottom-up pass. Since we want to sample 
from the space of satisfying assignments we can ignore false 
nodes in 4% entirely by considering a sub-DAG that excludes 
false nodes and edges leading to them, as shown by line 3. As 
an example, hideFalseNode when applied to Yı would remove 
node n8 and the edges immediately leading to it. Next, INC 
processes each of the remaining nodes in bottom-up order 
while keeping two caches - w to store the partial samples 
from each node, vy to store the joint probability at each node. 


Algorithm 2 INC - returns a satisfying assignment based on 
PROB w parameters 

Input: smooth PROB w 

Output: a sampled satisfying assignment 


1: cache w < initCache() 

2: joint prob cache y + initCache() 

3: Y © hideFalseNode(¢)) 

4: for node n of y’ in bottom-up order do 


5: ifn is true node then 

6 win] = 0 

T: efn] 1 

8: else if n is ^-node then 

9: w[n] + unionChild(Child(n), w) 
10: gln] = Teecritacny P14 

11: else 

12: Plo © 9Lo(n) x y|Lo(n)]| 

13: Phi — Bhin) X Y[Hi(n)] 

14: Pjoint — Plo + Phi 

15: pln] E Pjoint 

16: r 4 x ~ binomial(1, rere) 
17: if r is 1 then 

18: wjn] + w[Hi(n)] U Var(n) 
19: else 

20: wjn] + w[Lo(n)] U =~Var(n) 


21: return w[rootnode(7)] 


INC starts with Ø at the true node since there is no associated 
variable. 

At each conjunction node, INC takes the union of the child 
nodes in line 9. Using n2 in Figure 1 as an example, if sample 
drawn at n4 is w[n4] = {~z} and at n5 is w[nd5] = {y}, 
then unionChild(Child(n2),w) = {y,z}. At each decision 
node n, a decision on Var(n) is sampled from lines 16 
to 20. We first calculate the joint probabilities, pj, and pp; 
of choosing —Var(n) and choosing Var(n). Subsequently, 
we sample decision on Var(n) using a binomial distribution 
in line 16 with the probability of success being the joint 
probability of choosing Var(n). After processing all nodes, 
the sampled assignment is the output at root node of 4. 

Extending \NC to k samples: It is straightforward to 
extend the single sample INC shown in Algorithm 2 to draw 
k samples in a single pass, where k is a user-specified number. 
At each node, we have to store a list of k independent copies of 
partial assignments drawn in w. At each conjunction node ne, 
we perform the same union process in line 9 of Algorithm 2 
for child outputs in the same indices of the respective lists 
in w. More specifically, if ne has child nodes c, and cy, the 
outputs of index 2 are combined to get the output of n, at index 
i. This process is performed for all indices from 1 to k. At 
each decision node ng, we now draw k independent samples 
instead of a single sample from the binomial distribution as 
shown in line 16. The sampling step in lines 16 to 20 are 
performed independently for the k random numbers. There is 
no change necessary for the calculation of joint probabilities 
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in Algorithm 2 as there is no change in literal weights. 

Incremental sampling: Given a Boolean formula F 
and weight function W, INC performs incremental sampling 
with the sampling process shown in Figure 3. In the initial 
round, INC compiles F and W into a PROB w and performs 
sampling. Subsequent rounds involve applying a new set of 
weights W to y, typically generated based on existing samples 
by the controller [10], and performing weighted sampling 
according to the updated weights. The number of sampling 
rounds is determined by the controller component, whose logic 
varies according to application. 


Is F 


CNF F 
compiled? 


Weights W 


Update 
1 W 


Controller 


Samples 


Weighted 
Sampling 


Fig. 3: INC’s incremental sampling flow 


C. Implementation Decisions 


Log-Space Calculations: \NC performs annotation pro- 
cess - computation of joint probabilities in log space. This 
design choice is made to avoid the usage of arbitrary precision 
math libraries, which WAPS utilized to prevent numerical 
underflow after many successive multiplications of probability 
values. Using the LogSumExp trick below, it is possible to 
avoid numerical underflow. 


log(a + b) = log(a) + log(1 + S 


= log(a) + log(1 + exp(log(b) — log(a))) 


The joint probability at a decision node ng is given 
by ALo(nz) X joint probability of Lo(na) + OHi(ng) X 
joint probability of Hi(na). Notice that if we were to 
perform the calculation in log space, we would have to add 
the two weighted log joint probabilities, termed pro and 
Pri in Algorithm 2. Using the LogSumExp trick, we do 
not need to exponentiate pj, and pp; independently which 
risks running into numerical underflow. Instead, we only 
need to exponentiate the difference of pio and pp; which is 
more numerically stable. Equations EQ1 and EQ2 can be 
implemented in log space as follows: 


> 


c€Child(n.) 


Q of A-node n, = Q(c) 


Q of decision-node ng = LogSumExp[ 
log(PLo(na)) a Q(Lo(na)), 
log (OHi(na)) + Q(Hi(na))] 


In the equations above, Q refers to the corresponding log joint 
probabilities in EQ1 and EQ2. In the experiments section, 
we detail the runtime advantages of using log computations 
compared to arbitrary precision math computations. 

Dynamic Annotation: In existing state-of-the-art 
weighted sampler WAPS, sampling is performed in two 
passes - the first pass performs annotation and the second 
pass samples assignments according to the joint probabilities. 
In INC, we combine the two passes into a single bottom-up 
pass performing annotation dynamically while sampling at 
each node. 


D. Theoretical Analysis 


Proposition 1. Branch parameters of any decision node nq 
are correct sampling probabilities, i.e. W (xi) : W(7a;) = 
PHi(z;) : PLo(a;) Where Var(na) = zi. 


Proof. 
Wai) _ WedtwGr) _ FHi(e,) 
7a — W (nzi) E 
Wsi) W (ai) +W (a2; ) fLo(z:) 


We start with the ratio of literal weights of x, multiply both 
numerator and denominator by W (z;) +W (~zx;) and arrive at 
the ratio of branch parameters of na. Notice that only the ratio 
matters for sampling correctness and not the absolute value of 
weights. 


Remark 1. Let ng be an arbitrary decision node in PROB 
a. When performing sampling according to a weight function 
W, 9Lo(nq) i$ the probability of picking —Var(na) and Onifna) 
is that of Var(na). The determinism property states that the 
choice of either literal is disjoint at each decision node. 


Proposition 2. INC samples an assignment T from PROB w 
with probability + T],<, W(l), where N is a normalization 
factor. 


Proof. The proof consists of two parts, one for /-node and 
another for decision node. 

A-node: Let ne be an arbitrary conjunction node in 
PROB w. Recall that by decomposability property, Vc;,c; € 
Child(n.) and c; # cj, VarSet(c;) N VarSet(c;) = Ø. As 
such an arbitrary variable x; € VarSet(n.) only belongs to 
the variable set of one child node c; € Child(n.). Therefore, 
assignment of x; can be sampled independent of x; where 
xj E VarSet(c;),Ve; A cı. Let Tl, be partial assignment 
for child node c; € Child(ne). Notice that each partial 
assignment T/, is sampled independently of others as there 
are no overlapping variables, hence their joint probability is 
simply the product of their individual probabilities. This agrees 
with the weight of an assignment being the product of its 
components, up to a normalization factor. 

Decision node: Let nq be an arbitrary decision node in 
PROB w and xq be Var(na). At na, we sample an assignment 
of xq based on the parameters O24) and Oyi(2,), which 
are probabilities of literal assignment by Proposition 1. By 
Proposition 1, one can see that the assignment of xg is sampled 
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correctly according to W. As the sampling process at nq is 
independent of its child nodes by the determinism property, 
the joint probability of sampled assignment of xq and the 
output partial assignment from the corresponding child node 
would be the product of their probabilities. Notice that the 
joint probability aligns with the definition of weight of an 
assignment being the product of the weight of its literals, up 
to a normalization factor. 

Since we do not consider the false node and treat it 
as having 0 probability, we always sample from satisfying 
assignments by starting at the true node in bottom-up ordering. 
Reconciling the sampling process at the two types of nodes, 
it is obvious that any combination of decision and /A-nodes 
encountered in the sampling process would agree with a given 
weight function W up to a normalization factor 1/N. In 
fact, N = $` es W(T:i) where S is the set of satisfying 
assignments of Boolean formula F that ~ represents. As 
mentioned in Proposition | proof, normalization factors do 
not affect the correctness of sampling according to W, and 
we have shown that INC performs weighted sampling correctly 
under multiplicative weight functions. 


Remark 2. From the proof of Proposition 2, the determin- 
ism and decomposability property is important to ensure the 
correctness of INC. The smoothness property is important 
to ensure that the sampled assignment by \NC is complete. 
For formula F = (x V y) A (=x V =z), an assignment Tı 
sampled from a non-smooth PROB could be {x, ~z}. Notice 
that T, is missing assignment for variable y. By performing 
smoothing, we will be able to sample a complete assignment 
of all variables in the Boolean formula as both child nodes of 
each decision node n have the same VarSet(-). 


V. EXPERIMENTS 


We implement INC in Python 3.7.10, using NumPy 1.15 and 
Toposort package. In our experiments, we make use of an off- 
the-shelf KC diagram compiler, KCBox [23]. In the later parts 
of this section, we performed additional comparisons against 
an implementation of INC using the Gmpy2 arbitrary precision 
math package (INCap) to determine the impact of log-space 
annotation computations. 

Our benchmark suite consists of instances arising from 
a wide range of real-world applications such as DQMR 
networks, bit-blasted versions of SMT-LIB (SMT) bench- 
marks, ISCAS89 circuits, and configurable systems [6], [10]. 
For incremental updates, we rely on the weight generation 
mechanism proposed in the context of prior applications of 
incremental sampling [10]. In particular, new weights are 
generated based on the samples from the previous rounds, 
resulting in the need to recompute joint probabilities in each 
round. Keeping in line with prior work, we perform 10 rounds 
(R1-R10) of incremental weighted sampling and 100 samples 
drawn in each round. The experiments were conducted with a 
timeout of 3600 seconds on clusters with Intel Xeon Platinum 
8272CL processors. 

In this section, we detail the extensive experiments con- 
ducted to understand INC’s runtime behavior and to compare 


it with the existing state-of-the-art weighted sampler WAPS [6] 
in incremental weighted sampling tasks. We chose WAPS as 
it has been shown to achieve significant runtime improvement 
over other samplers, and accordingly has emerged as a sampler 
of the choice for practical applications [10]. In particular, our 
empirical evaluation sought to answer the following questions: 

RQ 1 How does INC’s incremental weighted sampling run- 
time performance compare to current state-of-the- 
art? 

How does using PROB affect runtime performance? 
How does log-space calculations impact runtime 
performance? 

RQ 4 Does INC correctly perform weighted sampling? 

RQ 1: Incremental Sampling Performance: The scatter 
plot of incremental sampling runtime comparison is shown in 
Figure 4, with Figure 4a showing runtime comparison for the 
first round (R1) and Figure 4b showing runtime comparison 
over 10 rounds. The vertical axes represent the runtime of 
INC and the horizontal axes represent that of WAPS. In 
the experiments, INC completed 650 out of 896 benchmarks 
whereas WAPS completed 674. INC completed 21 benchmarks 
that WAPS timed out and similarly, WAPS completed 45 
benchmarks that INC timed out. In the experiments, INC 
achieved a median speedup of 1.69 over WAPS. 

Further results are shown in Table I. Observe that for 
runtime taken for R1 (column 3), WAPS is faster and takes 
around 0.44x of INC’s runtime in the median case. However, 
INC takes the lead in runtime performance when we examine 
the total time taken for the incremental rounds R2 to R10 
(column 4). For incremental rounds, WAPS always took longer 
than INC, in the median case WAPS took 4.48x longer than 
INC. We compare the average incremental round runtime with 
the first round runtime for both samplers in columns 1 and 
2. In the median case, an incremental round for WAPS takes 
67% of the time for R1 whereas an incremental round for 
INC only requires 5.9% of the time R1 takes. We show the 
per round runtime for 5 benchmarks in Table II to further 
illustrate INC’s runtime advantage over WAPS for incremental 
sampling rounds, even though both tools reuse the respective 
KC diagram compiled in R1. This set of results highlights 
INC’s superior performance over WAPS in the handling of 
incremental sampling settings. INC’s advantage in incremental 
sampling rounds led to better overall runtime performance than 
WAPS in 75% of evaluations. The runtime advantage of INC 
would be more obvious in applications requiring more than 
10 rounds of samples. 

Therefore, we conducted sampling experiments for 20 
rounds to substantiate our claims that INC will have a larger 
runtime lead over WAPS with more rounds. Both samplers are 
given the same 3600s timeout as before and are to draw 100 
samples per round, for 20 rounds. The number of completed 
benchmarks is shown in Table III In the 20 sampling round 
setting, INC completed 649 out of 896 benchmarks, timing 
out on 1 additional benchmark compared to 10 sampling 
round setting. In comparison, WAPS completed 596 of 896 
benchmarks, timing out on 78 additional benchmarks than in 
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Fig. 4: Runtime comparisons between INC and state-of-the-art weighted sampler WAPS 


Statistic WAPS MEAN(R2 to R10) INC MEAN(R2 to R10) WAPS RI WAPS SUM(R2 to R10) WAPS Total 
WAPS R1 INC R1 INC R1 INC SUM(R2 to R10) INC Total 
Mean 0.74 0.064 1.03 15.66 6.12 
Std 0.24 0.040 1.47 26.42 10.73 
Median 0.67 0.059 0.44 4.48 1.69 
Max 1.25 0.188 10.65 172.66 73.96 


TABLE I: Incremental weighted sampling runtime ratio statistics for WAPS and INC (Numerators and denominators refer to 


the corresponding runtimes) 


the 10 sampling round setting. In addition, WAPS takes on 
median 2.17 longer than INC under the 20 sampling round 
setting, an increase over the 1.69x under the 10 sampling 
round setting. 

The runtime results clearly highlight the advantage of INC 
for incremental weighted sampling applications and that INC 
is noticeably better at incremental sampling than the current 
state-of-the-art. 

RQ 2: PROB Performance Impacts: We now focus on 
the analysis of the impact of using PROB compared to d- 
DNNF in the design of a weighted sampler. We analyzed 
the size of both PROB and d-DNNF across the benchmarks 
that both tools managed to compile and show the results 
in Table IV. From Table IV, PROB is always smaller than 
the corresponding d-DNNF. Additionally, PROB is at median 
4.64x smaller than the corresponding d-DNNF, and that for 
PROB is an order of magnitude smaller for at least 25% 
of the benchmarks. As such, PROB emerges as the clear 
choice of knowledge compilation diagram used in INC, owing 
to its succinctness which leads to fast incremental sampling 
runtimes. 


RQ 3: Log-space Computation Performance Impacts: 
In the design of INC, we utilized log-space computations 
to perform annotation computations as opposed to naively 
using arbitrary precision math libraries. In order to analyze the 
impact of this design choice, we implemented a version of INC 
where the dynamic annotation computations are performed 
using arbitrary precision math in a similar manner as WAPS. 
We refer to the arbitrary precision math version of INC as 
INCap. As an ablation study, we compare the runtime of 
both implementations across all the benchmarks and show the 
comparison in Table V. The statistics shown is for the ratio 
of INCap runtime to INC runtime, a value of 1.12 means that 
INCap takes 1.12 that of INC for the corresponding statistics. 


The results in Table V highlight the runtime advantages 
of our decision to use log-space computations over arbitrary 
precision computations. INC has faster runtime than INCap 
in majority of the benchmarks. INC displayed a minimum of 
0.70x, a median of 1.12x,and a max of 1.89x speedup over 
INCap. Furthermore, INC,p timed out on 2 more benchmarks 
compared to INC. It is worth emphasizing that log-space 
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Benchmark Tool R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 Total | Speed 
or-50-5-5-UC-10 WAPS 56.6 56.3 52.5 594 525 536 594 53.2 53.4 61.7 558.6 1.0x 
(100, 253) INC 1461.3 7.6 8.4 8.4 8.4 8.4 8.5 8.5 8.4 8.5 1536.3 0.4x 
or-100-20-9-UC-30 WAPS 73.0 69.1 66.7 760 66.5 66.9 766 66.0 66.9 78.6 706.1 1.0x 
(200, 528) INC 269.5 4.7 4.8 4.8 4.9 5.1 4.8 4.8 4.8 5.1 313.4 | 2.3x 
s953a_15_7 WAPS 1.7 1.1 1.1 1.2 1.0 1.1 1.2 1.1 1.1 1.3 11.9 1.0x 
(602, 1657) INC 4.9 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 11.5 1.0x 
h8max WAPS 90.3 104.2 924 1160 943 941 1129 929 944 120.4 1011.9 1.0x 
(1202, 3072) INC 34.1 2.1 2.2 2.4 2.3 2.4 2.2 2.4 2.4 2.3 55.7 | 18.2x 
innovator WAPS 195.5 221.9 201.3 244.4 200.1 206.7 247.2 202.0 202.9 257.4 2179.3 1.0x 
(1256, 50452) INC 32.8 1.6 1.8 1.9 1.9 1.9 1.8 1.9 1.9 1.9 49.4 | 44.1x 


TABLE II: Runtime (seconds) breakdowns for each of ten rounds (R1-R10) between WAPS and INC for benchmarks of 
different sizes e.g. “h8max’ benchmark consists of 1202 variables and 3072 clauses. 


Number of rounds | WAPS | INC 
10 | 674 | 650 
20 | 596 | 649 


TABLE II: Number of completed benchmarks within 3600s, 
for 10 and 20 round settings 


Statistic | wee 
Mean | 18.92 
Std | 81.19 
Median | 4.64 
Max | 1734.08 


TABLE IV: Statistics for number of nodes in d- DNNF (WAPS 
KC diagram) over that of smoothed PROB (INC KC diagram). 


computations do not introduce any error, and our usage of 
them sought to improve on the naive usage of arbitrary 
precision math libraries. 

RQ 4: INC Sampling Quality: We conducted additional 
evaluation to further substantiate evidence of INC’s sampling 
correctness, apart from theoretical analysis in Section IV-D. 
Specifically, we compared the samples from INC and WAPS, 
which has proven theoretical guarantees [6], on the ‘case110’ 
benchmark that is extensively used by prior works [4]-[6]. We 
gave each positive literal weight of 0.75 and each negative 
literal 0.25, and subsequently drew one million samples using 
both INC and WAPS and compare them in Figure 5. 


Statistic oa 
Mean 1.14 
Std 0.16 
Median 1.12 
Max 1.89 


TABLE V: Runtime comparison of INC and INCap 
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Fig. 5: Distribution comparison for Case110, with log scale 
for both axes 


Figure 5 shows the distributions of samples drawn by INC 
and WAPS for ‘case110’ benchmark. A point (x, y) on the plot 
represents y number of unique solutions that were sampled x 
times in the sampling process by the respective samplers. The 
almost perfect match between the weighted samples drawn 
by INC and WAPS, coupled with our theoretical analysis in 
Section IV-D, substantiates our claim INC’s correctness in 
performing weighted sampling. Additionally, it also shows that 
INC can be a functional replacement for existing state-of-the- 
art sampler WAPS, given that both have theoretical guarantees. 


Discussion: We demonstrated the runtime performance 
advantages of INC and the two main contributing factors - a 
choice of succinct knowledge compilation form and dynamic 
log-space annotation. INC takes longer than WAPS for single- 
round sampling, mainly because WAPS takes less time for 
KC diagram compilation than INC, leading to WAPS being 
faster in single-round sampling. In the incremental sampling 
setting, the compilation costs of KC diagrams are amortized, 
and since INC is substantially better at handling incremental 
updates, it thus took the overall runtime lead from WAPS 
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in the majority of the benchmarks. Extrapolating the trend, 
it is most likely that INC would have a larger runtime lead 
over WAPS for applications requiring more than 10 sampling 
rounds. The runtime breakdown demonstrates that INC is 
able to amortize the compilation time over the incremental 
sampling rounds, with subsequent rounds being much faster 
than WAPS. In summary, we show that INC is substantially 
better at incremental sampling than existing state-of-the-art. 


VI. CONCLUSION AND FUTURE WORK 


In conclusion, we introduced a bottom-up weighted sampler, 
INC, that is optimized for incremental weighted sampling. By 
exploiting the succinct structure of PROB and log-space com- 
putations, INC demonstrated superior runtime performance in 
a series of extensive benchmarks when compared to the cur- 
rent state-of-the-art weighted sampler WAPS. The improved 
runtime performance, coupled with correctness guarantees, 
makes a strong case for the wide adoption of INC in future 
applications. 

For future work, a natural step would be to seek further 
runtime improvements for PROB compilation since INC takes 
longer than SOTA for the initial sampling round, due to slower 
compilation. Another extension would be to investigate the 
design of a partial annotation algorithm to reduce computa- 
tions when only a small portion of the weights have been 
updated. It would also be of interest if we could store partial 
sampled assignments at each node as a succinct sketch to 
reduce memory footprint, for instance we could store each 
unique assignment and its count. 
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Abstract—Bounded Model Checking (BMC) is an effective and 
precise static analysis technique that reduces program verification 
to satisfiability (SAT) solving. In this paper, we present the 
design and implementation of a new BMC engine (SEABMC) 
in the SEAHORN verification framework for LLVM. SEABMC 
precisely models arithmetic, pointer, and memory operations of 
LLVM. Our key design innovation is to structure verification 
condition generation around a series of transformations, starting 
with a custom IR (called SEA-IR) that explicitly purifies all 
memory operations by explicating dependencies between them. 
This transformation-based approach enables supporting many 
different styles of verification conditions. To support memory 
safety checking, we extend our base approach with fat pointers 
and shadow bits of memory to keep track of metadata, such 
as the size of a pointed-to object. To evaluate SEABMC, we 
have used it to verify aws—c-—common library from AWS. We 
report on the effect of different encoding options with different 
SMT solvers, and also compare with CBMC, SMACK, KLEE 
and SYMBIOTIC. We show that SEABMC is capable of providing 
order of magnitude improvement compared with state-of-the-art. 


I. INTRODUCTION 


Bounded Model Checking (BMC) is an effective technique 
for precise software static analysis. It encodes a bounded 
(i.e., loop- and recursion-free) program P with assertions 
into a verification condition VC’ in (propositional) logic, such 
that VC is satisfiable iff P has an execution that violates 
an assertion. The satisfiability of VC is decided by a SAT- 
solver (or, more commonly, by an SMT-solver). BMC can 
be extremely precise, including path-sensitivity, bit-precision, 
and precise memory model. Its key weakness is scalability — 
precise reasoning requires careful selection of what details to 
include into the analysis. 

A BMC engine can be implemented directly at the level 
of program source code, as best illustrated by CBMC [1] - 
the oldest and most mature BMC for C. This allows verifying 
absence of undefined behaviour and other source-level prop- 
erties, and improves error reporting since it can be done at 
the source level. However, this complicates the implementa- 
tion because modern programming languages are incredibly 
complex. Moreover, most industrial code relies on de-facto, 
rather than the standard language semantics [2] and on non- 
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standard features that are supported by mainstream compilers. 
An alternative is to implement BMC on an intermediate 
representation (IR) of a compiler. LLVM IR [3], called bitcode, 
is a common choice. This simplifies implementation to focus 
only on capturing semantics of the IR, allows sharing infras- 
tructure with the compiler, simplifies integration of verification 
into current build systems, and simplifies supporting multiple 
source languages (e.g., SMACK [4] supports 8 languages [5]). 
This is the approach we take in this paper. 

Over the years, there have been multiple BMC tools de- 
veloped for LLVM, including SEAHORN (that we build on), 
SMACK, and LLBMC [6]. However, the issue still remains 
that existing tools are either not maintained, commercial (and 
not publicly available, e.g. LLBMC), or are not effective at 
bit- and memory-precise reasoning (SEAHORN and SMACK). 
Our goal is to address this deficiency, while re-examining 
and re-evaluating many of the design decisions. Thus, while 
BMC is a mature technique, we have two objectives. First, we 
want different strategies for generating verification conditions 
(VCGen) through program transformations. This allows us to 
examine which encoding works best in practice for production 
code, and why. Second, we want to provide mechanisms 
to express safety properties, e.g. memory safety, succinctly. 
In accomplishing these objectives, we believe that we have 
identified a new interesting point in the design space. 

For our first objective, we propose a new pipeline. A 
source program is translated to a new IR, called SEA-IR, that 
extends LLVM IR, with explicit dependency between memory 
operations. This, effectively, purifies memory operations, i.e., 
there is no global memory, and no side-effects. A SEA-IR 
program then goes through a series of program transformations 
for VCGen. The program is progressively reduced to a pure 
data-flow form in which all instructions execute in parallel, 
and is only then, converted to SMT-LIB supported logic. This 
allows experimenting with different strategies of VCGen by 
controlling these transformations. For example, we can gen- 
erate VCs using a control flow representation of the program 
like DAFNY [7] or a pure data flow representation like CBMC. 
VCs depend on memory representation. Thus, we explore 
two different forms of representing memory content: lambda- 
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based [8] that represents memory as nested ITE-expressions!, 


and array-based that uses SMT theory of arrays [9]. In partic- 
ular, lambda-based representation allows precise and efficient 
modelling of wide memory operations such as memcpy. We 
also explore the space of memory models between flat memory 
in which memory is a flat array, and an object memory where 
memory is represented by a set of arrays. 

To improve checking for safety properties, our second 
objective, we attach additional information to pointers (so 
called fat) and to memory (so called shadow). This simplifies 
tracking of various program metadata for modelling safety 
properties. As an example, we can use fat pointers to check for 
out of bounds array access and shadow memory to check for 
immutability of read only memory. While existing tools report 
memory safety analysis, SEABMC can capture metadata of 
arbitrary size since we are not constrained by concrete pointer 
or memory width. Additionally, we model pointer provenance. 
This allows us to catch out-of-bounds accesses which might 
be missed by tools like LLBMC and ASAN [10]. 

We evaluate SEABMC on verification tasks of 
aws-c-common C library developed by Amazon Web 
Services (AWS). The library is a collection of common 
data-structures for C (including buffers, arrays, lists, etc.). 
We chose it for several reasons. First of all, it has been 
recently verified using CBMC. Thus, it includes many 
meaningful verification tasks. Second, it is a live industrial 
project, thus, it provides an example of how to integrate 
SEABMC into a real project, and shows that SEABMC 
supports all of the necessary language features. Third, it 
provides an opportunity to compare head-to-head against a 
mature tool (CBMC) on industrial code. We feel this is a 
more interesting comparison than, for example, comparing 
on isolated verification benchmarks of SVCOMP [11]. We 
show that SEABMC is an order of magnitude faster than 
CBMC, and outperforms three mature LLVM-based tools: 
SMACK, SYMBIOTIC [12] and KLEE [13]. Note that we 
focus on SEABMC design and performance. An extensive 
case study comparing different kinds of verification tools on 
aws-c-common is available in [14]. 

In summary, this paper makes the following contributions: 
an IR, SEA-IR, for LLVM bitcode that purifies memory 
operations; a VCGen that combines program transformations 
with encoding into logic allowing for many different styles 
of VCs; a memory model that combines fat-pointers with 
shadow-memory to represent metadata; an open-sourced BMC 
tool; and, a thorough evaluation against the state-of-the-art 
verification tools on production C code. 


II. GENERATING VERIFICATION CONDITIONS 


This section presents our main verification condition gen- 
eration (VCGen) algorithm. We start with a new intermediate 
representation, that we call SEA-IR. This representation ex- 
tends LLVM bitcode with purified memory operations. We 
then describe a series of transformations that transform a 


ITTE stands for If-Then-Else. 


program in SEA-IR to a pure data-flow (PD) form where no 
part of computation depends on control. Each transformation 
progressively simplifies the program for generating verification 
conditions. The PD form is one from which verification condi- 
tions can be generated in the most straightforward way. Finally, 
we show how PD programs can be converted to verification 
conditions in SMT-LIB. In this section, we assume that 
the input program contains only one function, no loops or 
global variables. In practice, this is achieved by inlining all 
functions, unrolling loops to a fixed depth, and eliminating 
global variables. The loop unroll bound is often detected 
automatically, but can also be set by the user. 

SEA-IR SEABMC transforms LLVM bitcode to an inter- 
mediate representation, called SEA-IR, that extends LLVM 
bitcode by making dependency information between memory 
operations explicit. In LLVM IR, this information does not 
exist in the program. Fig. 1 shows the simplified syntax of 
SEA-IR. Here, we present a simplified version with many 
features removed, e.g., types, expressions, function calls, etc. 
However, we assume that the type of each register is known 
(but not shown). We use R to represent a scalar register, P for a 
pointer register and m for a memory register. A legal SEA-IR 
program is assumed to be in a Static Single Assignment (SSA) 
form with all registers are assigned before use, all expressions 
well-typed and a program always ending with a halt. 

We use the term object to refer to an allocated sequence 
of bytes in memory. Interestingly, we do not use a single 
addressable memory that maps from addresses to values. 
Instead, a SEA-IR program uses a set of memory regions 
or memories, which collectively contains all objects in a 
program. Each memory, in-turn, contains a subset of objects 
used in the program. To maintain compatibility with de-facto 
semantics, addresses are assigned from a single address space 
and are, thus, globally unique. To aid program analysis, all 
memories are pure: storing in memory creates a new memory 
i.e., definition; loading from a memory is a use. This def-use 
scheme [15] is known as MemorySSA in LLVM. Partitioning 
memory into multiple memories relieves the SMT-solver from 
some of the alias analysis reasoning. 

To explain SEA-IR, we use a simple C program in Fig. 2. 
The program initializes variable x with a non-deterministic 8- 
bit integer obtained by the return value of function nd_char (). 
The value of x is further constrained by the assume, such that 
x > 0 && x < 10. Then, the program non-deterministically 
allocates 1- or 2-byte memory region and assigns the address 
to the variable p. The first byte that p points to is assigned 
by the value of x. The second byte (if any) is assigned 0. 
For the moment, ignore that the second assignment might be 
undefined behaviour (we expand on this in Sec III). Finally, 
the two asserts describe the post-condition. 

Fig. 3a shows the SEA-IR program transformed from the 
C program. In this presentation, we do not strictly follow the 
syntax of SEA-IR. For example, we allow immediate values 
to appear in place of registers, and write expressions in infix 
form. The program is a single function main, which consists 
of four basic blocks labeled by BB0, BB1, BB2 and BB3. A basic 
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PR = fun main() {BB+} : 
BB = L: PHI* S* (BR | halt) 3 
BR == br&, L,L |brL 4 
PHI ::= R-=phi [R, L](, IR, L1)* | 5 
M = phi [M, L](, [M, L])* | 6 
P = phi [P, L](, [P, L])* 7 
S == RDEF | MDEF | VS 8 
RDEF = R=E| P, M=allocaR, M | 9 
P, M = malloc R, M | R= loadP, M | 10 
P = load P, M | M= free P, M 11 

MDEF = M=storeR, P, M | M=storeP, P, M 

VS = assert R | assume R 


Fig. 1: Simplified grammar of SEA-IR, where E, L R, P and M 
are expressions, labels, scalar registers, pointer registers and memory 
registers, respectively. 


block consists of a label, zero or more PHI-statements, one or 
more statements, an optional branch statement or a halt. 

A SEA-IR program has three types of registers: scalar 
registers, pointer registers and memory registers. Scalar reg- 
isters store values of basic datatype — integers. Pointers store 
pointer values. Memory registers store memory regions, and 
map from addresses to values. Each memory register maps to 
a unique memory and we use memory register and memory 
interchangeably. For example, in Fig. 3a, Ro is a scalar register 
which stores an integer and P1 is a register for storing a pointer. 
Mo and M1 are memory registers. Since each program is finite, 
the number of registers is finite as well. 

An assignment statement defines the register by the value 
of a given expression. We assume that expressions include the 
usual set of operations, e.g., arithmetic, bitwise operations, 
cast operations and pointer arithmetic. For example, in BB0 of 
Fig. 3a, R2 = RO < 10 defines the value of register R2 by the 
value of the expression RO < 10, where < is an unsigned 8-bit 
less-than operator. 

A phi selects a value from a list of values when a control 
flow merges. For example, M3 = phi[M1,BB1], [M2,BB2] in 
BB3 Of Fig. 3a assigns m1 (M2) to m3 if the previously executed 
basic block was BB1 (BB2). 

SEA-IR provides alloca and malloc instructions to allo- 
cate memory on the stack and the heap, respectively. A given 
number of bytes are allocated in memory on RHS of the 
statement, defining a new memory on the LHS. While the 
allocation does not change memory, it does define it. This is 
explained in Sec. III. Consider P1, m1 = MO in BB1 
of Fig. 3a. It allocates 2 bytes (on the heap) in memory mo, 
defines memory mı and a fresh pointer in P1. 

A store, €.g.,M5 = store 0, P5, M4 in BB3, defines mem- 
ory M5 by writing the value 0 to the address pointed-to by the 
pointer register p5 in memory m4. Note that the instruction 
is pure; i.e., all effects of the instructions are on the output 
registers only. The result of the modification is in m5, while m4 
is unchanged. Similarly, a load reads the value pointed-to by 
a pointer register in memory register m, and assigns the value 
to a new register. assert and assume are the usual verification 
statements for assertions and assumptions, respectively. 


malloc 2, 


int main() { 
uint8_t x = nd_char(); 
assume(x > 0 && x < 10); 


uint8_t *p = nd_bool() ? malloc(2«sizeof (uint8_t) ) 
: malloc (sizeof (uint8_t)); 

*p = Xx} 

*(p + 1) = 0; 


assert(0 < *p && *p < 10); 
assert(«(p + 1) == 0); // potential UB 
return 0; 


Fig. 2: An example C program. 


Program Transformation Before generating verification con- 
ditions, a series of program transformations, as given below, 
are applied to a SEA-IR program. 
Single Assert Form. A program is in a Single Assert (SA) 
form if it only contains one assert, which appears as the 
last instruction (before halt) in the last block of a program. 
Fig. 3b shows the code in a SA form transformed from 
the one in Fig. 3a, where an ERR label is added to the 
original code, and denotes an error state. In BB3, assert R6 iS 
transformed into br R6, BB4, ERR, meaning that if R6 is false, 
then the program’s execution trace is diverted to ERR. Similarly, 
assert 0 = 0 in BB3 is transformed into assume 0 != 0 and 
br ERR. 
Single Assume Single Assert (SASA) Form. A program is in 
SASA form if it is in SA form, and contains a single assume 
immediately followed by a single assert. For example, the 
two definition of registers R1 and R2 in BBO of Fig. 3b are 
combined into one definition of R1 in Fig. 3c, where the two 
boolean expressions are combined by a conjunction. A phi- 
statement, A = phi [R6,BB4], [R1,BB3], is added to ERR, so 
that register a tracks the value of the conjunction. The assume 
ensures that a is true prior to the assertion. 
Gated Single Static Assignment Form. A program in SASA 
form is further transformed into a Gated Single Static As- 
signment (GSSA) form, where phi-functions are replaced by 
select expressions”. For example, phi [M1,BB1], [M2,BB2] 
in ERR Of Fig. 3c is transformed into select R2, M1, M2 in 
Fig. 3d, where R2 is the condition that the program trace is 
diverted to BB1 or BB2. 
Pure Dataflow Form. A (loop-free) program is in a Pure 
Dataflow (PD) form if it is in GSSA form and contains a 
single basic block. As shown in Fig. 4a, all the labels and 
br are removed from Fig. 3d, and the five basic blocks are 
merged into one single basic block. 
Reduced Pure Dataflow Form. A program is in a reduced PD 
form if every definition appears on a def-use chain of either 
assume Or assert. Each such definition is said to be in the 
cone of influence (COI). In Fig. 4a, the highlighted code is 
not in the cone of influence and is not considered. 

A reduced PD program has no control dependencies. It is 
essentially a sequence of equations with two side-conditions 
determined by assume and assert. All definitions are used, 


2In LLVM, select is the usual ternary ITE such asa ? c : binC. 
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fun main() { fun main() { fun main() { fun main() { 
BBO: BBO: BBO: BBO: 
MO = mem.init () MO = mem.init () MO = mem.init () MO = mem.init () 
RO = nd_char() RO = nd_char() RO = nd_char() RO = nd_char ( 
Rl = RO > 0 Rl = RO > 0 Rl = RO > 0 && RO < 10 R1 = RO > 0 && RO < 10 
assume R1 assume R1 R2 = nd_bool() R2 = nd_bool () 
R2 = RO < 10 R2 = RO < 10 br R2, BB1, BB2 br R2, BBl, BB2 
assume R2 assume R2 
R3 = nd_bool () R3 = nd_bool ( 
br R3, BB1, BB2 br R3, BBl, BB2 
BB1: BB1: BB1: BB1: 
Pl, Ml = malloc 2, MO P1, Ml = malloc 2, MO P1, M1 = malloc 2, MO P1, M1 = malloc 2, MO 
br BB3 br BB3 br BB3 br BB3 
BB2: BB2: BB2: BB2: 
P2, M2 = malloc 1, MO P2, M2 = malloc 1, MO P2, M2 = malloc 1, MO P2, M2 = malloc 1, MO 
br BB3 br BB3 br BB3 br BB3 
BB3: BB3: BB3: BB3: 
M3 = phi [M1,BB1], [M2,BB2] M3 = phi [M1,BB1], [M2,BB2] M3 = phi [M1,BB1], [M2,BB2] M3 = select R2, Ml, M2 
P4 = phi [P1,BB1], [P2,BB2] P4 = phi [P1,BB1], [P2,BB2] P3 = phi [P1,BB1], [P2,BB2] P3 = select R2, P1, P2 
M4 = store R0, P4, M3 M4 = store R0, P4, M3 M4 = store R0, P3, M3 M4 = store RO, P3, M3 
P5 = P4 + 1 P5 = P4 + 1 P4=P3 +1 P4=P3 +1 
M5 = store 0, P5, M4 M5 = store 0, P5, M4 M5 = store 0, P4, M4 M5 = store 0, P4, M4 
R6 = RO > 0 && RO < 10 R6 = RO > 0 && RO < 10 R5 = RO > 0 && RO < 10 R5 = RO > 0 && RO < 10 
assert R6 br R6, BB4, ERR br R5, BB4, ERR br R5, BB4, ERR 
assert 0 == 0 BB4: BB4: BB4: 
halt assume 0 != 0 R6 = false R6 = false 
} br ERR br ERR br ERR 
ERR: ERR: ERR: 
assert false A = phi [R6,BB4], [R1,BB3] A = select R5, R6, R1 
halt assume A assume A 
} assert false assert false 
halt halt 
} } 
(a) SEA-IR (b) Single Assert (SA) (c) Single Assume (SASA) (d) Gated SSA (GSSA) 


Fig. 3: Program from Fig. 2 in: (a) SEA-IR, (b) SA, (c) SASA, and (d) GSSA forms. 


fun main() { 

entry: 
MO = mem.init() 
i = ROO RO < 10 rı = (0 < ro A fo < 10) A 
R2 = nd_bool () 
Pl, Ml = malloc 2, MO pi = addro Am, = mo A 
P2, M2 = malloc 1, MO p2 = addro +4A mz = Mo A 
P3 = select R2, Pl, P2 p3 = tte(r2, p1, p2) A 
M4 = store RO, P3, M3 
P4 = P3 + 1 pa =p +1^ 


M5 = store 0, P4, M4 


R5 = RO > 0 && RO < 10 rs = (ro > OA ro < 10) A 


R6 = false T6 = false N 
A = select R5, R6, R1 a = ite(r5, T6, r1) A 
assume A ATS, T6, T1 
a^ 
assert 0 
halt “false 
} 
(a) Pure-Dataflow (PD) (b) SMT-LIB 


Fig. 4: Program from Fig. 2 in PD and SMT-LIB forms. The 
highlighted lines are removed from the program. 


directly, or indirectly, by either assume or assert (or both). 
Now, generating VC implies mapping each definition into a 
logic equation. 


Verification Condition Generation We now describe the 
translation function sym that encodes a program into a VC. 
Throughout the section, we illustrate sym using the program 
in Fig. 4a and the corresponding VC in Fig. 4b. 


The input to sym is a SEA-IR program in a reduced PD 
form, and the output is a SMT-LIB program. For simplicity 
of presentation, we assume that two fundamental sorts are 
used in the encoding: bit-vector of 64 bits, bv(64), and a map 


between bit-vectors, buv(64) — bv(64).? In addition, we use 
the following helper sorts: scalr : bu(64), ptrs : scalr, and 
mems : bu(64) —> bv(64), where scalr is sorts of scalars, 
ptrs of pointers, and mems of memories. 

sym is defined recursively, bottom up, on the abstract 
syntax tree of SEA-IR. First, each register, R, is mapped to a 
symbolic constant sym(r) of an appropriate sort. To simplify 
the presentation, we use a lower-case math font for constants 
corresponding to the register. For example, in Fig. 4a, sym(ro) 
is To of scalr sort, sym(p2) is po of ptrs sort, and sym(mo) 
is mo of mems sort, respectively. 

Second, each expression £ in SEA-IR is mapped into a 
corresponding SMT-LIB expression sym(z). We omit the 
details of this step since they are fairly standard. For example, 
a select is translated into an ite, scalar addition, such as 
R9 + 11s translated into bit-vector addition bvada, etc. Pointer 
manipulating expressions, such as pointer arithmetic (gep) and 
pointer-to-integer cast (ptoi) are described in Sec. HI. 

Finally, sym translates each statement into an equality. For 
example, R = z is translated into r = e, where e is sym(E). 
For example, in Fig. 4a, A = select R5,R6,R1 is translated 
into a = ite(rs,76,171) in Fig. 4b. 

Translating alloca and malloc requires a memory allocator. 
We parameterize sym by an allocation function alloc : A > 
ptrs that maps allocation expressions in A to values of pointer 
sort. For example, in Fig. 5, P1, M1 = alloca RO, MO is 
translated into pı = alloc(alloca RO m0) A Mı = mo, and is 
reduced to pı = addro Am, = mo, where addro is the return 
value of alloc. 


3In practice, SEABMC supports multiple bit-widths for scalars, and different 
ranges for values for maps. 
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sym(R = E)2r=e sym(assume R)Sr sym(assert R) Ê -r 
sym(M1 = store R1,P2,M0) Ê mı = write(mo, 11, p2) 


sym(R1 = load P0,M) Ê pi = read(m, po) 


sym(P1,M1 = alloca RO,MO) 4n= alloc(alloca R0,M0) A mi = Mo 


sym(P1,M1 = malloc R0,M0) Ê pı = alloc(malloc R0,M0) A mi = mo 


Fig. 5: Definition of sym. 


Va € A- size(a) is known Va € A- (alloc(a) mod align(a)) = 0 
Va; # a2 € A- (alloc(a,) + size(ai) < alloc(az)) V (alloc(az)+ 
size(az) < alloc(ay)) 


Fig. 6: Specifications for size, align, and alloc. 


For sym, alloc must satisfy the basic specifications of a 
memory allocator. The spec is formalized in Fig. 6, where 
size and align return the size and alignment of each allocation 
expression in A. Intuitively, each allocated segment must have 
a statically known bound on size, all pointers returned by 
an allocation are aligned, and all allocations are mutually 
disjoint. For example, in Fig. 4a, the memory allocations 
in P2, M1 = malloc 2, MO and P1, M2 = malloc 1, MO are 
guaranteed to be disjoint since Fig. 4b adds a constraint that 
pı = addro ^ p2 = addro + 4. In practice, we also enforce 
that stack allocations (alloca) return high addresses, and heap 
allocations (malloc) return low addresses. Other constraints, 
such as separating kernel- and user-space addresses can be 
easily added. 

The semantics for memory operations depends on the rep- 
resentation of memories (see Sec. II). We use two functions, 
read and write, to encapsulate the actual translation when 
defining the meaning of load and store, respectively. The 
function read(m, p) represents the value of the memory reg- 
ister m at index p. The function write(m, r1, pọ) represents a 
new memory obtained by writing the value rı at index pə in 
m. In Fig. 5, Load PO, Mand store R1, P2, MO are translated 
into read(m, po), and write(mo,11, p2), respectively. 

SEABMC has two memory representations: Arrays and 
Lambdas. 

Arrays. Memories are modeled by an SMT-LIB theory of 
extensional arrays ArraysEx*. A memory register m is mapped 
to a symbolic constant m, where m is of sort mems. As shown 
in Fig. 7, a write is translated into an ArrayEx store, and a 
read is translated into an ArrayEx select. 

Lambdas. Memories are modelled by A-functions of the form 
Ax.e, where e is an expression with free occurrences of x. A 
memory register M is translated into an uninterpreted function 
m of sort mems. As shown in Fig. 7, read(m, ro) is translated 
into a function application m(ro), and write(mo, r1, p2) is 
translated into a new A-function, Aw.ite(x = p2,r1, mo). In 
the final VC, function applications are 3-reduced to substitute 
formal arguments with actual parameters. Thus, the VC only 


“http://smtlib.cs.uiowa.edu/theories- ArraysEx.shtml. 


| Array A 
select m po m(po) 
store mo rı p2 Ax.ite(x = p2, rı, mo(x)) 


read(m, po) 
write(mo, r1, p2) 


Fig. 7: Translation of read and write. 


RDEF ::= R= isderefR, R | R=isallocR,M | R= ismodR, M 


Fig. 8: SEA-IR syntax for memory safety. 


has ites, and does not require ArrayEx support in the SMT- 


solver. 

Overall, for a program P in a reduced PD form with a 
sequence of statements Sọ: Sp, followed by assume RO and 
assert R1, sym(P) is defined as follows: 


sym(P) = VAN sym(si) | A sym(R0) A sym(R1). 


O0<i<k 
For example, the VC for a program in Fig. 4a is shown in 
Fig. 4b. Definitions in Fig. 4a are translated into a conjunction 
of equalities, and assert 0 is translated into false. The VC 
is unsatisfiable since a evaluates to false. 
Theorem I: sym(e) is satisfiable iff p? has an execution that 
satisfies the assumption and violates the assertion. 


III. VERIFYING MEMORY SAFETY 


In most languages, including C, memory safety is difficult to 
specify directly. To make such specifications possible, we use 
fat pointers [16] and shadow memory to keep metadata about 
pointers and memory, respectively. Moreover, we present a 
general extension of both memory and pointer semantics. 

Intuitively, we want to represent each fat pointer as a tuple 
of values that collectively represent the value of the pointer 
and all the metadata (i.e., fat) that is cached at it. We do 
not put restrictions on the number of values nor their sorts. 
However, we assume that there is a function addr that maps 
a pointer to an expression representing an address. Thus, 
for a pointer register P, sym(p) is a tuple (t),...,t;) of j 
constants that represents the pointer, and addr((t1,...,t,)) is 
an address of that pointer. For example, a common case is 
to use the first element of the tuple to represent the address: 
addr((ti,...,t;)) = tı. Fig. 13 presents a small program (on 
the left) that writes a fat pointer po to memory at address 
P1. Memory is divided into five parts with val memory used 
to store the actual program data. Here, val stores the base 
value of the fat pointer and offset and size store the fat. 
Memory operations are tracked by alloc and mod memory 
that mark whether an address is allocated and whether it has 
been written to, respectively. Fig. 13 shows the memory state 
after the store operation. Both alloc and mod are set to 1 
because P1 is allocated and has been modified. 

Formally, we re-define ptrs to be a tuple of sorts, written 
as (S1,...,5;). We say that a tuple 7 = (ci,...,¢p) of p 
constants is of a tuple sort (s1,..., Sp) iff, for each 0 < i < p, 
Ci is of sort s;. Tuples of sorts, and tuples of constants are 
only present during VCGen, but not in the final verification 
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fun main() { 


allocsn (m, p) = 


BBO: ‘ 4 
MOE mei inico (m.val, m.offset, m.size, write(m.alloc, 1, p.base),m.mod) 
RO = nd_char() A 
Rl = RO > 0 && RO < 10 m=r>0Arm <10A freesn(m, p) = 
R2 = nd bool () (m.val, m.offset, m.size, write(m.alloc, 0, p.base), m.mod) 
P1, Ml = malloc 2, MO pı.base = addro A py.offset = 0 A pı.size = 4 A mi = Mmo ^ A 
P2, M2 = malloc 1, MO py.base = addro + 4 A p2.offset = 0 A pz.size = 4 A mz = Mo A storesn((m.alloc, m.mod) > p) = 


M3 = select R2, Ml, M2 
P3 = select R2, P1, P2 p= ite(r2,pi,p2) A 
R4 = isderef R3, 1 r4 =0 < (1+ p3. offset) < p3.size A 


P5 = gep P3, 1 

R6 = isderef P5, 1 re =0< (1+ ps.offset) < ps.size A 
RT = RO > 0 && RO < 10 =T0>OATO<10A 

R8 = false ra = false ^ 

A0 = select R7, R8, R1 “= ite(r7, rs, r1) A 

Al = select R6, AO, R1 % = ite(re, ao, r1) A 

A2 = select R4, Al, R1 @2= ite(r4, a1, r1) A 

assume (A2) az ^ 

assert (0) afalse 

halt 


(b) VC in SMT-LIB 
(a) Pure-Dataflow (PD) 


Fig. 9: Program from Fig. 2 in PD and SMT-LIB forms. 
isderef instruction checks for spatial memory safety. 


The 


sym(P1,M1 = malloc R0,M0)) ê 

pı = alloc(malloce RO, M0) A my = allocsn (mo, pı) 
sym(M1 = free P0,M0) = mı = freesn (mo, po) 
sym(MR = store R1,P2,M1))4 


(Mr1,-++;Mrz) = (write(mo.1, r1, addr(p2.1)),..-, 
write(mo.j, rı, addr(p2.j))) A 
(m1j41,---,M1p) = storesn ((Moj41,-++, MOK) P2) 


sym(R1 = load P0,M0) ê 


rı = (read(mg.1, addr(po)),..., read(mo.j, addr(po))) 


Fig. 10: Memory-safety aware VCGen semantics. 


condition. For that, we rewrite equality between two tuples as 
conjunction of equalities between their elements, and use T.i 
for the zth element of tuple 7. 

Similarly, we re-define mems for a memory register m to 
be a tuple of values that store the program and the shadow 
states. Thus, sym(m) = (vo, . - - , Ux), Where each v; is the sort 
bu(64) — bu(64). If a pointer is represented by a j-tuple, we 
assume that memory is represented by a k-tuple, with k > j, 
so that the first 7 entries in a memory register are wide enough 
to store the fat pointer. Specifically, we require that the sort 
of vj is same as sort of t; for 1 < j < k. 

We modify the semantics of malloc by storing meta data 
along with explicit program states. The modification is defined 
in Fig. 10 (mı is now a memory tuple). The signature of alloc 
is unchanged, but now returns a fat pointer. Given a pointer 
p of sort ptrs, a function size((t,,...,t;)) returns the size 
of a memory object pointed-to by p. An additional function 
allocsn : mems —> mems operates on shadow memory. The 
semantics of allocs, and frees, is described later. 

A store is divided into two parts. First is the store of 
the actual program data. Since the data can be of sort scalr 
or ptrs, a store of a k-tuple of data on memory mo is 
translated into k writes, on each element of (m0 ,...,™0;)- 
Second is updating metadata, done by stores, that works 


(m.alloc, write(m.mod, 1, p.base)) 


Fig. 11: Shadow memory semantics for memory safety. 


ps.base = r3.base ^ ps.offset = r3.offset + 1 A ps.size = r3.size ^ 


sym(R1 = isderef P0 B) ê rı = 0 < po.offset < po.size 
sym(R1 = isalloc P0 M) ê rı = read(m.alloc, po.base) 


sym(R1 = ismod P0 M) ê rı = read(m.mod, po.base) 


Fig. 12: Semantics for verifiying memory safety. 


on (Moj+1;---, Mog). The details of storesn are described 
later in this section. Similarly, a load expects to read 
(Moi, ---, Moj} Of sort ptrs. This allows representing arbi- 
trary fat and shadows. We illustrate its specializations for 
memory safety next. 

Spatial memory safety A program satisfies spatial memory 
safety iff every read and write is always inside an allocated 
object. A fat pointer is defined as a tuple of three constants 
(s1, 52,83) denoted as (base, offset, size) for convenience. 
Here base is the start address of the object, offset is an index 
into the object, and size is its size. The address addr is given 
by base + offset. 

With fat pointers, we introduce instructions for pointer 
arithmetic and pointer integer casts. The instruction gep is 
used for pointer arithmetic. Fig. 9a shows an example use 
in R5 = gep R3, 1. Here, semantically, a new pointer R5 is 
created that has the same base and size as R3, with offset 
incremented by 1. We also introduce ptoi instruction that casts 
a pointer to an integer by adding offset to base. For an integer 
to pointer cast, we use the itop instruction. This instruction 
sets base to the integer value and fat (i.e., metadata) to zero. 

To assert that a pointer dereference is spatially safe, we 
provide an isderef instruction, whose semantics is shown 
in Fig. 12. For example, the program in Fig. 9a executes 
assert (0) aS R6 = isderef R5, 1 evaluates to false causing 
Al and a2 to evaluate to R1 and true, respectively. Thus, the 
VC in Fig. 9b is satisfiable which exposes the out of bounds 
error in Fig. 2 line 9. Note that this error is not caught by the 
VC in Sec. II. In SEABMC, we automatically add isderef 
assertions before memory accesses. Many of such assertions 
are statically and, thus, cheaply resolved to true or false prior 
to SMT solving. 

Note that SEABMC semantics for spatial safety differs from 
LLBMC [17]. LLBMC treats only accesses to unallocated 
memory as unsafe. This implies that it is valid for a pointer to 
overflow into another object allocated just below or above. 
In SEABMC, jumping across the allocated boundary is in- 
valid. SEABMC also differs from CBMC in this regard. In 
CBMC [1], the pointer representation is fixed and a few bits 
in the pointer representation are reserved for fat data. These 
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MO = mem.init() 

// addr=0x100 

PO, Ml = malloc 1, MO 
// addr=0x200 

Pl, M2 = malloc 1, M1 
M3 = store P0, Pl, M2 
halt 


Memory P1:| base | offset| size 
0x200| 0 1 


val |offset| size | alloc | mod 


0x100 


of[ij[ifa 


(a) a program (b) memory state 


Fig. 13: Memory state M3 - when PO is stored at location P1. 


constraint the available address range. Additionally, only lim- 
ited metadata can be stored in each pointer. In SEABMC, we 
support composite pointer representations that maintain parity 
with concrete pointer representation while allowing for rich 
metadata in the fat region of the pointer. 

Temporal memory safety A program satisfies temporal 
memory safety iff it never does one of the following: (UAF) an 
object is used after it has been freed; and (RO) an object 
marked as read-only (by programmers) is modified. We detect 
a violation of memory safety by tracking the status of a 
memory object using shadow memory. Each memory is a tuple 
(v1,...,U5) of constants of sort bv(64) > bv(64), denoted 
(val, offset, size, alloc, mod), where (val, offset, size) maps 
to pointer data (base, offset, size), and alloc and mod track 
the allocated and modified status of an object, respectively. 

An object can be in allocated or freed state. To track 
allocated state, sym in Sec. II is extended for alloca, malloc, 
and free. The new semantics is shown in Fig. 10. The function 
allocs, : mems — mems is defined, for temporal memory 
safety, as shown in Fig. 11. Note that alloc,,(m,r) marks 
m.mod memory only at the start of an object, i.e., r.base. For 
this reason it is necessary to use the fat pointer representation 
since it records the base for every pointer. The isalloc 
instruction, shown in Fig. 8, is used to check the allocated 
state of an object at any point in the program. The semantics 
for isalloc is defined in Fig. 12. 

A C program has no native mechanism for verifying that 
an object remains unmodified when passed to a function. To 
remedy this, we extend the semantics for store (see Fig. 10). 
The function stores, : mems —> mems is implemented for 
temporal memory safety (see Fig. 11). The ismod in Fig. 8 is 
used to check the read-only state of an object at any program 
point. The semantics for ismod is given in Fig. 12. We also 
provide a companion instruction resetmod R, M that resets 
m.mod at address r.base to zero. This allows initializing an 
object, resetting modified state, and then checking that the 
subsequent program does not modify the object. We track 
memory state only at object granularity, therefore, the current 
implementation is tied to using the fat pointer representation. 


IV. EXPERIMENTS 


In this section, we describe the evaluation of SEABMC> on 
verification tasks from aws-c-common. Each task verified 
post-conditions and memory safety of a single function from 


Source at https://github.com/seahorn/seahorn/tree/dev 10. 


aws—c-—common. Overall, there are 169 tasks in 20K LOC. 
Results and tasks are available at https://github.com/seahorn/ 
verify-c-common®. We have chosen these tasks because they 
represent a real industrial use-case of BMC. We have adapted 
them from CBMC to be compatible with LLVM-based C 
verification tools. Note that here we focus on SEABMC perfor- 
mance. A detailed comparison of different kinds of verification 
tools on aws—c-—common is presented in [14]. 

Comparing Different VCGen Strategies We evaluate the 
effectiveness of the different VCGen strategies by controlling 
which transformations are enabled. The main performance 
metric is time solved — the time to solve all solved tasks’ 
(i.e., with timeout excluded). The time limit is 600s per task. 

First, we evaluate the two memory representations: Ar- 
rays vs Lambdas. We use Z3 [18] and YICES2 [19] to 
account for the difference between SMT-solvers. The results 
are summarized in Tab. Ia. For Z3, we find that Arrays 
are less efficient than Lambdas. For YICES2, the results are 
comparable, suggesting that the choice of the representation 
is less important. Z3 with Lambdas is the overall winner, and 
we use it for the rest of the experiments. 

Second, we evaluate the effectiveness of the transformations 
in Sec. II. The results are in Tab. Ib. Here, optimal means 
applying all of the transformation involved, plus eagerly sim- 
plifying VC during VCGen. -reducing lambdas introduces 
many nested ITE-terms, so simplifying them early is useful. 

To evaluate, we compare with 5 additional strategies by 
disabling some transformations: 1) rel_alloc — use alloc that 
returns relative addresses from some symbolic start of stack 
and heap, rather than concrete addresses 2) flat_mem — one 
flat memory instead of using alias analysis to partition memory 
into disjoint memories as much as possible 3) no_coi — disable 
cone-of-influence 4) no_simp — disable eager simplification 
5) p_cond — generate VC directly from SSA form by using 
path condition to encode phi-functions as in [6], [20]. Re- 
moving any of the transformations either noticeably degrades 
performance, or causes a timeout. 

SEABMC supports memory word size of 1 byte (bu(8)), 4 
bytes (bv(32)) and 8 bytes (bv(64)). The 1-byte words are 
most precise and support arbitrary memory accesses, while 8- 
byte words require aligned accesses. The comparison between 
the two is shown in Tab. Ic. Wider words significantly improve 
performance, but can be unsound for some benchmarks. By 
supporting both, SEABMC lets the user pick most appropriate 
choice per benchmarks. In other experiments, we adjust word 
size per individual benchmarks.® 
Shadow memory performance A C program has no builtin 
mechanism for verifying that an object is not modified by a 
function. To overcome this limitation, the verification tasks 
in aws—c-—common record the value of a byte from a non 
deterministic offset within an allocated object and then verify 
that this byte is unchanged in all executions. While this is a 


This website includes instructions for reproducing the experiments. 
7This analysis uses 172 tasks instead of 169. 3 tasks are SEABMC specific. 
8CBMC uses a similar per-benchmark configuration as well. 
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solved 
time(s) 


solved 


time(s) timeout 


config solver unsat timeout failed config unsat avg(s) 


std(s) 


config unsat solved time(s) 


143 


solved 


failed time(s) 


word size unsat timeout 


no shadow memory 70 


array 23 158 8 6 1647 optimal 172 0 836 3 
yices2 170 0 2 1016 rel_alloc 172 0 1456 8 

lambdas z3 172 0 0 836 flat_mem 163 9 2689 16 
yices2 172 0 0 912 no_coi 170 2 849 5 

no_simp 166 6 1429 9 

p_cond 170 2, 659 4 


(a) Different memory repre- 


sentations. (b) Different encodings. 


shadow memory 70 90 


156 0 16 
171 1 0 


679 
2546 


bu(64) 
bu(8) 


(d) Different memory fea- 


(c) Different word sizes. tures. 


TABLE I: Evaluations of different configuration. 


clever technique, setting it up in a verification task is complex. 
The ismod instruction added in SEABMC (see Sec. III) offers 
a user friendly alternative. We also found it to be more 
performant in the SEABMC implementation. We ported 70 
tasks in aws—c-common to use ismod. Ported tasks ran 55% 
faster, on average, than their originals (see Tab. Id). This 
strengthens the case for shadow memory from both usability 
and performance perspectives. 

SEABMC vs. State-of-the-Art Overall, the results for our 
configurations in previous discussion suggest that the optimal 
strategy provides best performance in terms of precision and 
efficiency. We also consider four tools comparing against: 
CBMC [1], SMACK [4], KLEE [13], and SYMBIOTIC [12]. 
LLBMC is another interesting BMC tool, however, we de- 
cided to exclude it from comparisons due to the lack of 
an easily accessible public version? for user to reproduce 
LLBMC results. CBMC is, perhaps, the oldest and most well- 
known BMC for C programs (not based on LLVM). It is 
actively used by AWS, and was used for the verification of 
aws—c-—common. SMACK is an LLVM-based BMC tool that 
uses Boogie [21] and Corral [4] for bounded and deductive 
verification. SYMBIOTIC is a KLEE-based tool that combines 
program instrumentation, slicing, and symbolic execution [22]. 
Both SMACK and SYMBIOTIC performed very well on the 
“SoftwareSystems” category in SV-COMP’21. KLEE is a 
LLVM-based symbolic execution tool that does not encode 
the VC in one shot but rather explores satisfiability of path 
conditions in a program one path-at-a-time. It is a practical 
alternative to BMC. 

The results collected on an AMD Ryzen(TM) 5 5600X CPU 
with 32 GB memory are shown in Tab. II. Only SEABMC 
and CBMC solve all verification tasks from aws—c—common. 
SMACK in bit-precise mode times out on most instances, and 
in arithmetic mode times out on 20 and fails on 4. SYMBIOTIC 
times out on 5 and fails on 10. It is best-performing on 
priority_queue and ring_buffer. However, it also 
failed to detect seeded bugs!°, which questions its results. 
KLEE is particularly effective on Linked_list — showing 
the benefit of exploring path-at-a-time, when number of paths 
is small. 

Bugs found In [14], we discuss bugs found and reported to 
AWS. One example, in Fig. 14, concerns the byte buffer 
data structure that is defined as a length delimited byte string. 


°LLBMC source code is not publicly available; Binary download on 
website is broken. 
‘Details at https://github.com/seahorn/verify-c-common/issues/124 


typedef 

struct byte_buf { 
char» buf; 
int len, cap; 

} BB; 

bool BB_is_ok(BB xb) 

{ return (b->len == 0 

|| b->buf); } 


CIDNWARWNE 


Fig. 14: Incorrect byte_buf invariant 


Its data representation should be either the buffer (buf) is 
NULL or its capacity (cap) is 0 (not the len as defined in 
BB_is_ok Line 7). Under the correct model (a malloc that 
can potentially fail), SEABMC produces counter examples in 
50 seconds, CBMC in 112 seconds. However, KLEE cannot 
detect this bug since it needs an allocated buffer with an 
explicit size to proceed with analysis. 

Overall, SEABMC outperforms competitors on most cate- 
gories and in the overall run-time. Thus, we conclude that 
SEABMC is a highly efficient BMC engine. 

We have compared SEABMC with tools from SV-COMP, but 
not with the benchmarks. There are two reasons. First, while 
a version of aws-c-common appears in SV-COMP, it is 
pre-processed with CBMC harnesses, and, therefore, includes 
undefined behaviors (e.g., uninitialized variables). This is not 
supported by SEABMC front-end. Second, we felt it is more 
important to validate tools in an actively developed code-base. 
Thus, we focused our effort on building an infrastructure for 
continuously verifying current aws—c-—common using many 
existing tools, rather than integrating SEABMC into the rules 
of SV-COMP. 


V. RELATED WORK 


Bounded Software Model Checking is a mature program 
analysis technique. We briefly review only some of the closest 
related work. Over the years, there have been many model 
checking tools built on top of the LLVM platform. The 
closest to ours is the work of Babic [23] and LLBMC 
[17]. Similarly to [23], we rely on the Gated SSA form to 
remove all control dependence leaving only data-flows to be 
represented. However, our encoding is significantly simplified 
by an intermediate representation that purifies memory flows. 
Unfortunately, [23] has not been maintained making head-to- 
head comparison difficult. 

We borrow the idea of using lambda-encoding for repre- 
senting memory from LLBMC [17]. One important advantage 
of lambdas is that we can represent memory operations such 
as memcpy efficiently (while with arrays, these have to be 
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Statistics SEABMC CBMC SMACK SYMBIOTIC KLEE 
category cnt loc avg (s) std (s) time (s) avg (s) std (s) time (s) cnt fid/to avg (s) std(s) time (s) cnt fid/to avg (s) std(s) time (s) cnt avg (s) std(s) time (s) 
arithmetic 6 202 1 0 3 4 0 22 6 2/0 3 1 18 6 0/0 135 281 809 6 1 0 5 
array 4 390 2 1 7 6 0 23 4 0/1 53 98 213 4 0/0 11 4 44 4 26 2 103 
array_list 24 3,150 3 4 71 19 33 450 24 0/0 5: 1 126 23 0/0 43 68 980 24 41 38 994 
byte_buf 29 2,908 1 1 29 9 10 252 29 0/2 27 50 788 29 0/0 40 162 1,168 27 59 96 1,592 
byte_cursor 24 2,365 1 0 23 6 3 153 16 0/2 32 66 519 17 0/0 7 4 125 17 10 11 169 
hash_callback ki 347 6 2; 18 8 5 25 3 0/0 4 2 11 3 0/0 40 62 120 3. 50 38 151 
hash_iter 4 708 9 15 37 10 6 39 4 0/0 91 58 363 3 0/1 37: 44 112 3 14 6 41 
hash_table 19 3,295 6 8 105 19 28 366 19 2/4 54 79 1,025 15 8/4 472 1,261 7,088 15 33 72 492 
linked_list 18 2,127 2 2 37 33 112 595 18 0/5 96 91 1,735 18 0/0 8 5 143 18 1 0 12 
others 2 31 0 0 1 4 0 7 1 0/0 2 0 2 1 0/0 5 0 $ 1 1 0 1 
priority_queue 15 3,004 14 22 202 286 700 4,284 15 0/1 20 50 307 15 0/0 10 20 152 15 32 8 473 
ring_buffer 6 934 21 22 128 13 8 78 6 0/3 133 98 796 6 1/0 10 9 63 6 30 16 180 
string 15 1,329 3 2 49 7 $ 104 15 0/2 31 69 467 15 1/0 9 11 137 15 102 106 1,528 
total 169 20,790 710 6,398 4/20 6,370 10/5 10,946 5,741 


TABLE II: Verification results for SEABMC, CBMC, SMACK, SYMBIOTIC, and KLEE. 


Timeout for SMACK and SEABMC 


is 200s, and 5,000s for SYMBIOTIC. ent, fid, to, avg, std and time, are the number of verification tasks, failed cases, timeout 
cases, average run-time, standard deviation, and total run-time in seconds, per category. 


unfolded). In particular, this allows for unbounded verification 
of loop-free programs that use these operations. The most 
significant difference from LLBMC is in our encoding of 
memory safety. In particular, we cache bounds information 
in the pointer, and check that every access is inside the 
allocated memory object. In contrast, LLBMC assumes an 
arbitrary allocator and checks that all accesses are into some 
allocated memory, not necessarily into the expected object. 
Unfortunately, there is no public version of LLBMC available, 
precluding a head-to-head comparison. 

SMACK [4], [5] is probably the most known BMC for 
LLVM. It is based on Boogie and Corral from Microsoft 
Research. It is most effective for arithmetic abstraction of soft- 
ware (i.e., abstracting machine integers by arbitrary precision 
integers). Its model for memory safety relies on complex en- 
coding using universally quantified axioms in Boogie, leading 
to quantified reasoning in SMT. In contrast, our representation 
is tuned to perform well with modern SMT solvers. SMACK 
shares SEADSA [24], [25] alias analysis with SEABMc. DI- 
VINE4 [26] is an explicit state model checker that also targets 
LLVM. However, it uses LLVM 7 which makes head-to- 
head comparison difficult. It targets parallel programs, which 
SEABMC does not. For sequential programs, it is related to 
libFuzzer and KLEE that we compare with. 


VI. CONCLUSION 


We have presented the techniques behind SEABMC, a new 
LLV M-base Bounded Model Checker for C. SEABMC is path- 
sensitive, bit-precise, and provides a precise model of memory. 
It extends the traditional memory model with fat pointers 
and shadow memory that allow attaching metadata to pointers 
and memory. We have evaluated SEABMC against CBMC, 
SMACK, SYMBIOTIC, and KLEE and show significant per- 
formance improvements over the competition. 
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flag meaning 


--unwind 1 

-—-flush 
—-object-bits 8 
--malloc-may-fail 
--malloc-fail-null 


number of times to unwind loops 

print to stdout 

number of pointer bits to store meta information 
malloc may fail 

malloc may fail and return NULL 


TABLE V: CBMC options for no-mem-safe. 


flag meaning 


--unwind 1 

-—-flush 
--object-bits 8 
--malloc-may-fail 
--malloc-fail-null 


number of times to unwind loops 

print to stdout 

number of pointer bits to store meta information 
malloc may fail 

malloc may fail and return NULL 


TABLE VI: CBMC options no-memmove. 


verification task config tool run-time (s) 
: list ul SEABMC 4 
aws-array-list-erase al CBMC 98 
aws-atray-list-erase no-mem-safe SEABMC 2 
Se res ; CBMC 98 
aws-array-list-erase no-memmove REABMC 2 
See CBMC 40 


TABLE III: SEABMC vs. CBMC for aws-array-list-—erase. 


flag meaning 


--unwind 1 

--flush 
--object-bits 8 
--malloc-may-fail 
--malloc-fail-null 
--bounds_check 
—-pointer_check 


number of times to unwind loops 

print to stdout 

number of pointer bits to store meta information 
malloc may fail 

malloc may fail and return NULL 

check access is within bounds 

check access is within bounds 


TABLE IV: CBMC options for all. 


APPENDIX 


Performance of SEABMC vs CBMC In this section we 
look at performance of SEABMC vs CBMC more closely. 
In App. A we study tool performance on a single task by 
using different features of the tools. In App. B, we look at the 
CBMC flags used for the analysis. 


A. Comprehensive Analysis w.rt. CBMC 


SEABMC outperforms CBMC on many of the categories. 
To ensure that the comparison is “fair”, we have done a 


comprehensive manual analysis with a few verification tasks. 

For a fair comparison, one must show that the verification 
problem being solved is the same. While both tools verify 
user-supplied assertions in aws—c-—common, they also verify 
internal properties such as memory safety, integer overflow, 
etc., depending on how they are invoked. For example, CBMC 
checks for integer overflow, while SEABMC does not. Hence, 
as a first step, we identified all such options in CBMC and 
disabled them. 

There are many other factors that differentiate SEABMC 
and CBMC including: IRs (i.e., GOTO program vs. LLVM- 
IR), model of memory operations, and VCGen. Thus, we 
identified the differences that benefit SEABMC. We chose 
one verification task aws-array-list-erase, and derived 3 
configurations based on the above analysis!!: 1) All: SEABMC 
and CBMC verify a similar set of properties, namely, user- 
supplied assertions and memory safety. 2) No Memory Safety: 
SEABMC and CBMC verify user-supplied assertions only. 
3) No memmove: aws-array-list-erase USCS memmove in its 
implementation. Since memmove has custom implementations 
in both SEABMC and CBMC, we evaluated run-time when 
disabling the assertions for it.!? 

The results are shown in Tab. III. We present the analysis 
for one verification task, however, the same applied to other 
verification tasks where SEABMC outperforms CBMC- even 


'lSee App. B for CBMC flags used. 
!2These assertions guarantee spatial memory safety of memmove. 
when verifying similar properties. Further manual analysis 


shows that most difference is due to the model of memory 
in SEABMC and CBMC. Specifically, memory operations on 
large blocks, are very expensive for CBMC (40s vs. 98s due 
to pre-conditions for memmove in Tab. III). 


B. Command line options for CBMC 


This section lists the CBMC command line flags used for 
aws-array-list-erase Verification job for different configu- 
ration. 
all Options to enable user assertions and memory safety checks 
no-mem-safe Options to enable user assertions only 
no-memmove Options to enable user assertions and remove 
memory safety and memmove checks. 

The memmove checks are disabled manually in source code. 
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Automatic Repair and Deadlock Detection 
for Parameterized Systems 
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Abstract—We present an algorithm for the repair of param- 
eterized systems. The repair problem is, for a given process 
implementation, to find a refinement such that a given safety 
property is satisfied by the resulting parameterized system, and 
deadlocks are avoided. Our algorithm uses a parameterized 
model checker to determine the correctness of candidate solutions 
and employs a constraint system to rule out candidates. We 
apply this algorithm on systems that can be represented as 
well-structured transition systems (WSTS), including disjunctive 
systems, pairwise rendezvous systems, and broadcast protocols. 
Moreover, we show that parameterized deadlock detection can be 
decided in EXPTIME for disjunctive systems, and that deadlock 
detection is in general undecidable for broadcast protocols. 


I. INTRODUCTION 


Concurrent systems are hard to get correct, and are therefore 
a promising application area for formal methods. For systems 
that are composed of an arbitrary number of processes n, 
methods such as parameterized model checking can provide 
correctness guarantees that hold regardless of n. While the pa- 
rameterized model checking problem (PMCP) is undecidable 
even if we restrict systems to uniform finite-state processes [1], 
there exist several approaches that decide the problem for 
specific classes of systems and properties [2]-[10]. 

However, if parameterized model checking detects a fault in 
a given system, it does not tell us how to repair the latter such 
that it satisfies the specification. To repair the system, the user 
has to find out which behavior of the system causes the fault, 
and how it can be corrected. Both tasks may be nontrivial. 

For faults in the internal behavior of a process, the approach 
we propose is based on a similar idea as existing repair 
approaches [11], [12]: we start with a non-deterministic im- 
plementation, and restrict non-determinism to obtain a correct 
implementation. This non-determinism may have been added 
by a designer to “propose” possible repairs for a system that 
is known or suspected to be faulty. 

However, repairing a process internally will not be enough 
in the presence of concurrency. We need to go beyond existing 
repair approaches, and also repair the communication between 
processes to ensure the large number of possible interactions 
between processes is correct as well. We do so by choosing the 
right options out of a set of possible interactions, combining 
the idea above with that of synchronization synthesis [13], 
[14]. 

In addition to guaranteeing safety properties, we aim for 
an approach that avoids introducing deadlocks, which is par- 
ticularly important for a repair algorithm, since often the 
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easiest way to “repair” a system is to let it run into a 
deadlock as quickly as possible. Unlike non-determinism for 
repairing internal behavior, we are even able to introduce non- 
determinism for repairing communication automatically. 
Regardless of whether faults are fixed in the internal behav- 
ior or in the communication of processes, we aim for a parame- 
terized correctness guarantee, i.e., the repaired implementation 
should be correct in a system with any number of processes. 
We show how to achieve this by integrating techniques from 
parameterized model checking into our repair approach. 


High-Level Parameterized Repair Algorithm. Figure 1 
sketches the basic idea of our parameterized repair algorithm. 


No: error sequence E 


Refine constraints 


; No Yes: 6’ = 
Unrealizable Restrict M with 6’ 


Fig. 1: Parameterized repair of concurrent systems 


The algorithm starts with a representation M of the pa- 
rameterized system, based on non-deterministic models of 
the components, and checks if error states are reachable for 
any size of M. If not, the components are already correct. 
Otherwise, the parameterized model checker returns an error 
sequence E, i.e., one or more concrete error paths. E is then 
encoded into constraints that ensure that any component that 
satisfies them will avoid the error paths detected so far. A SAT 
solver is used to find out if any solution still exists, and if 
so we restrict M to components that avoid previously found 
errors. To guarantee that this restriction does not introduce 
deadlocks, the next step is a parameterized deadlock detection. 
This provides similar information as the model checker, and 
is used to refine the constraints if deadlocks are reachable. 
Otherwise, W” is sent to the parameterized model checker for 
the next iteration. 


Research Challenges. Parameterized model checking in gen- 
eral is known to be undecidable, but different decision pro- 
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cedures exist for certain classes of systems, such as guarded 
protocols with disjunctive guards (or disjunctive systems) [4], 
pairwise systems [2] and broadcast protocols [3]. However, 
these theoretical solutions are not uniform and do not provide 
practical algorithms that allow us to extract the information 
needed for our repair approach. Therefore, the following 
challenges need to be overcome to obtain an effective param- 
eterized repair algorithm for a broad class of systems: 

C1 The parameterized model checking algorithm should be 
uniform, and needs to provide information about error 
paths in the current candidate model that allow us to avoid 
such error paths in future repair candidates. 

We need an effective approach for parameterized dead- 
lock detection, preferably supplying similar information 
as the model checker. 

We need to identify an encoding of the discovered in- 
formation into constraints such that the repair process is 
sufficiently flexible!, and sufficiently efficient to handle 
examples of interesting size. 


C2 


C3 


Parameterized Repair: an Example. Consider a system with 
one scheduler (Fig. 2) and an arbitrary number of reader-writer 
processes (Fig. 3), running concurrently and communicating 
via pairwise rendezvous, i.e., every send actions (e.g. write!) 
needs to synchronize with a receive action (e.g. write?) by 
another process. In this system, multiple processes can be in 
the writing state at the same time, which must be avoided if 
they use a shared resource. We want to repair the system by 
restricting communication of the scheduler. 

According to the idea in Fig. 1, the parameterized model 
checker searches for reachable errors, and it may find that 
after two consecutive write! transitions by different reader- 
writer processes, they both occupy the writing state at the 
same time. This information is then encoded into constraints 
on the behavior of processes, which restrict non-determinism 
and communication and make the given error path impossible. 
To repair the system we mainly need somehow to oblige a 
process to wait for the action done, (done writing) before 
entering the writing state. However, in our example all errors 
could be avoided by simply removing all outgoing transitions 
of state q4, of the scheduler. To avoid such repairs, our 
algorithm uses initial constraints (see section IV) that enforce 
totality on the transition relation. Another undesirable solution 
would be the scheduler shown in Fig. 4, because the resulting 
system will deadlock immediately. This is avoided by checking 
reachability of deadlocks on candidate repairs. We get a 
solution that is safe and deadlock-free if we take Fig. 4 and 
flip all transitions. 


Contributions. Our main contribution is a counterexample- 
guided parameterized repair approach, based on model check- 
ing of well-structured transition systems (WSTS) [15], [16]. 
We investigate which information a parameterized model 
checker needs to provide to guide the search for candidate 


'For example, to allow the user to specify additional properties of the repair, 
such as keeping certain states reachable. 
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repairs, and how this information can be encoded into propo- 
sitional constraints. Our repair algorithm supports internal 
repairs and repairs of the communication behavior, while 
systematically avoiding deadlocks in many classes of systems, 
including disjunctive systems, pairwise systems and broadcast 
protocols. 

Since existing model checking algorithms for WSTS do not 
support deadlock detection, our approach has a subprocedure 
for this problem, which relies on new theoretical results: (i) 
for disjunctive systems, we provide a novel deadlock detec- 
tion algorithm, based on an abstract transition system, that 
improves on the complexity of the best known solution; (ii) 
for broadcast protocols we prove that deadlock detection is in 
general undecidable, so approximate methods have to be used. 
We also discuss approximate methods to detect deadlocks in 
pairwise systems, which can be used as an alternative to the 
existing approach that has a prohibitive complexity. 

Finally, we evaluate an implementation of our algorithm on 
benchmarks from different application domains, including a 
distributed lock service and a robot-flocking protocol. 


II. SYSTEM MODEL 


For simplicity, we first restrict our attention to disjunctive 
systems, other systems will be considered in Sect. V-B. In the 
following, let Q be a finite set of states. 


Processes. A process template is a transition system U = 
(Qu,initu, Gu, ôu), where Qu C Q is a finite set of states 
including the initial state inity, Gu C P(Q) is a set of guards, 
and ôy : Qu Xx Gu Xx Qu is a guarded transition relation. 

We denote by ty a transition of U, i.e., ty € dy, and by 
ôy (qu) the set of all outgoing transitions of qy € Qu. We 
assume that ôy is total, i.e., for every qu € Qu, ĉu (qu) #9. 
Define the size of U as |U| = |Qu]. An instance of template 
U will be called a U-process. 


Disjunctive Systems. Fix process templates A and B with 
Q = Qa Ù Qp, and let G = G4 U Gpr and ô = 54 U ôg. We 


226 


consider systems A||B”, consisting of one A-process and n 
B-processes in an interleaving parallel composition.” 

The systems we consider are called “disjunctive” since 
guards are interpreted disjunctively, i.e., a transition with 
a guard g is enabled if there exists another process that 
is currently in one of the states in g. Figures 5 and 6 
give examples of process templates. 

An example disjunctive system is 
A||B”, where A is the writer and B 
the reader, and the guards determine 
which transition can be taken by a 
process, depending on its own state 
and the state of other processes in the 


{nw) 


system. Transitions with the trivial $ >< 
guard g = Q are displayed without a Fig. 5: Fig. 6: 
guard. We formalize the semantics of Writer Reader 


disjunctive systems in the following. 


Counter System. A configuration of a system A|| B” is a tuple 
(qa,c), where q4 € Qa, and c : Qg — No. We identify c 
with the vector (c(qo),---,€(qB)-1)) € NP, and also use 
c(i) to refer to c(q;). Intuitively, c(i) indicates how many 
processes are in state g;. We denote by u; the unit vector with 
u;(i) = 1 and u,(j) = 0 for j Æ i. 

Given a configuration s = (q4, c€), we say that the guard g 
of a local transition (qu, g, qy) € du is satisfied in s, denoted 
s qu 9, if one of the following conditions holds: 

(a) qu = qa, and Jq; E Qg with q; € g and c(i) > 1 

(A takes the transition, a B-process is in g) 

qu # qa, C(qu) 2 1, and qa € g 

(B-process takes the transition, A is in g) 

(c) qu # qa, (qu) = 1, and Aq € Qg with qi € g, qi A qu 
and c(t) > 1 

(B-process takes the transition, another B-process is in 
different state in g) 

qu # qa, qu € g, and c(qu) > 2 

(B-process takes the transition, another B-process is in 
same state in g) 

We say that the local transition (qu, g, qy) is enabled in s. 

Then the configuration space of all systems A||B”, for 
fixed A,B but arbitrary n € N, is the transition system 
M = (S, So, A) where: 

e SCQAX Nl”! is the set of states, 

e So = {(init4, c) | c(q) = 0 if q F initg)} is the set of 

initial states, 

e A is the set of transitions ((q4, c), (q4, c')) s-t. one of 

the following holds: 
1) c = c A 3(qa, 9, q4) € ĝa : (qa, €) 
of A) 
2) qa = qa ^Elqi,g,qj) € OB: eli) > LAC =c- u; + 
uj A^ (qa, €) Ea: 9 
(transition of a B-process) 
We will also call M the counter system (of A and B), and 
will call configurations states of M, or global states. 


(b) 


(d) 


=q, g (transition 


?The form A|| B” is only assumed for simplicity of presentation. Our results 
extend to systems with an arbitrary number of process templates. 


Let s,s’ € S be states of M, and U € {A,B}. For a 
transition (s, s’) € A we also write s — s’. If the transition is 
based on the local transition ty = (qu, g,qy) € du, we also 
write s Æ s' or s $ s’. Let Area(s) = {ty | s w s'}, 
i.e., the set of all enabled outgoing local transitions from s, 
and let A(s,ty) = s’ if s ŻY, s!, From now on we assume 
wlog. that each guard g € G is a singleton.’ 


Runs. A path of a counter system is a (finite or infinite) 
sequence of states £ = $1, S2,... such that Sm —> Sm+1 for 
all m € N with m < |z] if the path is finite. A maximal path 
is a path that cannot be extended, and a run is a maximal path 
starting in an initial state. We say that a run is deadlocked if it 
is finite. Note that every run s1, s2,... of the counter system 
corresponds to a run of a fixed system A||B”, i.e., the number 
of processes does not change during a run. Given a set of error 
states E C S, an error path is a finite path that starts in an 
initial state and ends in EF. 


The Parameterized Repair Problem. Let M = (S, So, A) 
be the counter system for process templates A = 
(Qa, inita, Ga, ÔA), B = (QB, initp, Gp, ôB), and ERR C 
Qa x NIP a set of error states. The parameterized re- 
pair problem is to decide if there exist process templates 
A = (Qa, init, Ga, 8⁄4) with Oy C ôa and B’ = 
(Qz, initg,Ge, ôg) with ôs C dg such that the counter 
system M’ for A’ and B’ does not reach any state in ERR. 

If they exist, we call 6’ = 6’, Ud a repair for A and 
B. We call M’ the restriction of M to 6’, also denoted 
Restrict(M, 6’). 

Note that by our assumption that the local transition rela- 
tions are total, a trivial repair that disables all transitions from 
some state is not allowed. 


III. PARAMETERIZED MODEL CHECKING OF DISJUNCTIVE 
SYSTEMS 


In this section, we address research challenges C1 and 
C2: after establishing that counter systems can be framed 
as well-structured transition systems (WSTS) (Sect. III-A), 
we introduce a parameterized model checking algorithm for 
disjunctive systems that suits our needs (Sect. III-B), and 
finally show how the algorithm can be modified to also check 
for the reachability of deadlocked states (Sect. I-C). Full 
proofs for the lemmas in this section can be found in the 
extended version [17]. 


A. Counter Systems as WSTS 


Well-quasi-order. Given a set of states S, a binary relation < 
C SxS is a well-quasi-order (wqo) if < is reflexive, transitive, 
and if any infinite sequence so, s1,... E S” contains a pair 
si < sj with i < j. A subset R C S is an antichain if any two 
distinct elements of R are incomparable wrt. <. Therefore, < 


3This is not a restriction as any local transition (qu; 9, qy) with 
a guard g E€ G and |g| > 1 can be split into |g| transitions 


(qU, 91; ay)» -- -> (QU; Ig} dqy) Where for all 2 < |g] : gi € g is a singleton 
guard. 
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is a wqo on S if and only if it is well-founded and has no 
infinite antichains. 


Upward-closed Sets. Let < be a wqo on S. The upward 
closure of a set R C S, denoted |R, is the set {s € S| ds’ € 
R:s' < s}. We say that R is upward-closed if +R = R. If R 
is upward-closed, then we call B C S a basis of Rif TB = R. 
If < is also antisymmetric, then any basis of R has a unique 
subset of minimal elements. We call this set the minimal basis 
of R, denoted minBasis(R). 


Compatibility. Given a counter system M = (S, So, A), we 
say that a wqo < C SxS is compatible with A if the following 
holds: Vs,s',r€S: if ss’ and s < r then Jr’ with s’ < 
r’ and r —* r’. We say < is strongly compatible with A if 
the above holds with r — r’ instead of r >* r’. 


WSTS [15]. We say that (M, <) with M = (S, So, A) is a 
well-structured transition system if < is a wqo on S that is 
compatible with A. 

Lemma 1: Let M = (S, So, A) be a counter system for 
process templates A,B, and let S$ C S x S be the binary 
relation defined by: 


(qa,¢) S (a4,4) & (qa =q4a ^c $d), 


where < is the component-wise ordering of vectors. Then 
(M,S) is a WSTS. 


Predecessor, Effective pred-basis [16]. Let W = (S, So, A) 
be a counter system and let R C S. Then the set of immediate 
predecessors of R is 


R:s 


pred(R) = {s € S | 3r >r}. 
A WSTS (M,S) has effective pred-basis if there exists an 
algorithm that takes as input any finite set R C S and returns 
a finite basis of tpred(tR). Note that, since & is strongly 
compatible with A, if a set R C S is upward-closed with 
respect to S then pred(R) is also upward-closed.* 

For backward reachability analysis, we want to compute 
pred*(R) as the limit of the sequence Ro C Ri C ... where 
Ro = Rand Rii1 = R; U pred(R;). Note that if we have 
strong compatibility and effective pred-basis, we can compute 
pred*(R) for any upward-closed set R. If we can furthermore 
check intersection of upward-closed sets with initial states 
(which is easy for counter systems), then reachability of 
arbitrary upward-closed sets is decidable. 

The following lemma, like Lemma 1, can be considered 
folklore. We present it here mainly to show how we can 
effectively compute the predecessors, which is an important 
ingredient of our model checking algorithm. 

Lemma 2: Let M = (S, So, A) be a counter system for 
guarded process templates A, B. Then (M, £) has effective 
pred-basis. 


4For a formal proof, check the extended version [17]. 


B. Model Checking Algorithm 


Our model checking algorithm is based on the known back- 
wards reachability algorithm for WSTS [15]. We present it in 
detail to show how it stores intermediate results to return an 
error sequence, from which we derive concrete error paths. 


Algorithm 1 Parameterized Model Checking 


1: procedure MODELCHECK(Counter System M,ERR) 

2: tempSet + ERR, Ey + ERR, i & 1, visited + Q 
// A fixed point is reached if visited = tempSet 

3 while tempSet + visited do 

4 visited + tempSet 

5: E; © minBasis(pred(tE;_1)) 

6: //pred is computed as in the proof of Lemma 2 

7 if E;N So Æ Ú then //intersect with initial states? 

8 

9 


return False, {Eo,...,£;.9 So} 
tempSet + minBasis(visited U E;) 
10: titl 
ll: return True, 0 


Given a counter system M and a finite basis ERR of the 
set of error states, algorithm 1 iteratively computes the set 
of predecessors until it reaches an initial state, or a fixed 
point. The procedure returns either True, i.e. the system is 
safe, or an error sequence Eo,...,E,%, where ko = ERR, 
YO < i < k : EK = minBasis(tpred(tE;_-1)), and 
Ex, = minBasis(tpred(tEz~-1)) N So. That is, every Æ; is 
the minimal basis of the states that can reach ERR in i steps. 


Properties of Algorithm 1. Correctness of the algorithm 
follows from the correctness of the algorithm by Abdulla 
et al. [15], and from Lemma 2. Termination follows from 
the fact that a non-terminating run would produce an infinite 
minimal basis, which is impossible since a minimal basis is 
an antichain. 


Example. Consider the reader-writer system in Figures 5 and 
6. Suppose the error states are all states where the writer is in w 
while a reader is in r. In other words, the error set of the corre- 
sponding counter system M is Eo where Eo = {(w, (0, 1))} 
and (0, 1) means zero reader-processes are in nr and one in r. 
Note that Eo = {(w, (io, %1)) | (w, (0, 1)) S (w, (io, i1))} 
i.e. all elements with the same w, iọ > 0 and 2; > 1. If 
we run Algorithm 1 with the parameters M, {(w, (0, 1))} 
we get the following error sequence: Ey = {(w,(0,1))} 
FE, = {(nw, (0,1))}, E2 = {(nw, (1,0))}, with E2 N So Æ 0, 
i.e., the error is reachable. 


> 
’ 


C. Deadlock Detection in Disjunctive Systems 


The repair of concurrent systems is much harder than fixing 
monolithic systems. One of the sources of complexity is that 
a repair might introduce a deadlock, which is usually an 
unwanted behavior. In this section we show how we can detect 
deadlocks in disjunctive systems. 

Note that a set of deadlocked states is in general not upward- 
closed under S (defined in Sect. MI-A): let s = (qa,c),r = 
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(qa,d) be global states with s S r. If s is deadlocked, then 
c(i) = 0 for every q; that appears in a guard of an outgoing 
local transition from s. Now if d(i) > 0 for one of these qi, 
then some transition is enabled in r, which is therefore not 
deadlocked. 

A natural idea is to refine the wqo such that deadlocked 
states are upward closed. To this end, consider SoG NII x 
NUP! where 


cod & (eS dAVi < |B: (c(i) =0 6 d(i) =0)), 


and So C S x S where (qa,c) Xo 
(qa = q4 ^€ Xo d). 

Then, deadlocked states are upward closed with respect to 
So. However, it is not easy to adopt the WSTS approach to this 
case, since for our counter systems pred() will in general not 
be upward closed if R is upward closed. Instead of using So 
to define a WSTS, in the following we will use it to define 
a counter abstraction (similar to the approach of Pnueli et 
al. [18]) that can be used for deadlock detection. 

The idea is that we use vectors with counter values from 
{0,1} to represent their upward closure with respect to So. 
These upward closures will be seen as abstract states, and in 
the usual way define that a transition between abstract states 
5, 8’ exists iff there exists a transition between concrete states 
s € 15,8’ € TS’. We formalize the abstract system in the 
following, assuming wlog. that 6g does not contain transitions 
of the form (q;,{q:},q;), i.e., transitions from q; that are 
guarded by q;.° 


(4d) © 


01-Counter System. For a given counter system M, we define 
the 01-Counter System M = (S, s, A), where: 
Sc Qa x {0,1}!! is the set of states, 
e ŝo = (inity,c) with c(q) = 1 iff q = initg is the initial 
state, 
e A is the set of transitions ((q4,c),(q4,¢’)) s.t. one of 
the following holds: 
1) c = c A 3(qa, g, q4) € 64: (qa, €) 
of A) 
2) qa = A, AAG, g, qj) € ôB : (q4, €) Hq; gAc(i) = 1A 
[(e(j) = 0A (ce = c — u; + uj Vc = c + u;))V 
(e(j) = 1A (ce = c — u; Vc’ =c))] (transition of a 
B-process) 


=q, g (transition 


Define runs and deadlocks of a 01-counter system similarly 
as for counter systems. For a state s = (q4,c) of M, define 
the corresponding abstract state of M as a(s) = (q4, ĉ) with 
ê(i) = 0 if c(i) = 0, and ¢ = 1 otherwise. 

Theorem 1: The 0l-counter system M has a deadlocked 
run if and only if the counter system M has a deadlocked run. 

Proof idea: Suppose x = s1, 89,..., Sf is a deadlocked 
run of M. Note that for any s € S, a transition based on local 
transition ty € dy is enabled if and only if a transition based 
on ty is enabled in the abstract state a(s) of M. Then it is 


5A system that does not satisfy this assumption can easily be transformed 
into one that does, with a linear blowup in the number of states, and preserving 
reachability properties including reachability of deadlocks. 


easy to see that = a(s)),a(s2),...,a(sy) is a deadlocked 
run of M. E 
Now, suppose ĉ = $1, 82, ..., Sf is a deadlocked run of M. 


Let b be the number of transitions (ŝx, ŝk+1) based on some 
te = (qi, 9,qj) € g with ŝķ+1(i) = 1, i.e., the transitions 
where we keep a 1 in position 7. Furthermore, let t1,...,tf—1 
be the sequence of local transitions that ĉ is based on. Then 
we can construct a deadlocked run of M in the following way: 
We start in sı = (init4, c1) with c;(initg) = 2° and for every 
tk in the sequence do: 
e if t € 64, we take the same transition once, 
e if tk = (qi, g,qj) E€ ôg with §,41(¢) = 0, we take the 
same local transition until position 7 becomes empty, and 
e if tk = (qi, g,qj) E€ Op with §,41(¢) = 1, we take the 
same local transition 5 times, where c is the number of 
processes that are in position ¿ before (i.e., we move half 
of the processes to j, and keep the other half in 2). 
By construction, after any of the transitions in t1,...,tf—1, 
the same positions as in ĉ will be occupied in our constructed 
run, thus the same transitions are enabled. Therefore, the 
constructed run ends in a deadlocked state. W 
Corollary 1: Deadlock detection in disjunctive systems is 
decidable in EXPTIME (in |QpB]). 


An Algorithm for Deadlock Detection. Now we can modify 
the model-checking algorithm to detect deadlocks in a 01- 
counter system M: instead of passing a basis of the set 
of errors in the parameter ERR, we pass a finite set of 
deadlocked states DEAD C S, and predecessors can directly 
be computed by pred. Thus, an error sequence is of the form 
Eo,...,E,, where Ey = DEAD, VO < i < k: = 
pred( E;-1), and Ey, = Ek—1 N So. 


IV. PARAMETERIZED REPAIR ALGORITHM 


Now, we can introduce a parameterized repair algorithm 
that interleaves the backwards model checking algorithm 
(Algorithm 1) with a forward reachability analysis and the 
computation of candidate repairs. 


Forward Reachability Analysis. In the following, for a set 
R C S, let Succ(R) = {9 € S| ds E€ R: s > s} 
Furthermore, for s € S, let A(s, R) = {ty € ô | tu € 
Alecal(s) A A(s, tu) € R}. 

Given an error sequence Eo, ..., Ep, let the reachable error 
sequence RE = REpo,...,RE, be defined by RE, = Ek 
(which by definition only contains initial states), and RE;—1 = 
Succ(RE;)N t&;-1 for 1 < i < k. That is, each RE; 
contains a set of states that can reach TERR in i steps, and 
are reachable from Sọ in k — i steps. Thus, it represents a set 
of concrete error paths of length k. 


Constraint Solving for Candidate Repairs. The generation 
of candidate repairs is guided by constraints over the local 
transitions ô as atomic propositions, such that a satisfying 
assignment of the constraints corresponds to the candidate 


6Note that a similar, but more involved construction is also possible with 
C1 (initg) =b; 
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repair, where only transitions that are assigned true remain 
in 6’. During an execution of the algorithm, these constraints 
ensure that all error paths discovered so far will be avoided, 
and include a set of fixed constraints that express additional 
desired properties of the system, as explained in the following. 


Initial Constraints. To avoid the construction of repairs that 
violate the totality assumption on the transition relations of 
the process templates, every repair for disjunctive systems has 
to satisfy the following constraint: 

tan N V t 


TRConstr pis; = VAN Vy 
qBEQB tg€ôsB (qB) 


qAEQa tacda(qa) 


Informally, T RC'onstr Disj guarantees that a candidate repair 
returned by the SAT solver never removes all local transitions 
of a local state in Q4 U Qp. Furthermore a designer can add 
constraints that are needed to obtain a repair that conforms 
with their requirements, for example to ensure that certain 
states remain reachable in the repair (see the extended ver- 
sion [17] for more examples). 


A Parameterized Repair Algorithm. Given a counter system 
M, a basis ERR of the error states, and initial Boolean 
constraints initConstr on the transition relation (including 
at least TRConstrpis;), Algorithm 2 returns either a repair 
6’ or the string Unrealizable to denote that no repair exists. 


Properties of Algorithm 2. 
Theorem 2 (Soundness): For every repair 6’ returned by 
Algorithm 2: 
e Restrict(M, 6’) is safe, i.e., (ERR is not reachable, and 
e the transition relation of Restrict(M, 0’) is total in the 
first two arguments. 


Proof: The parameterized model checker guarantees that 
the transition relation is safe, i.e., [ERR is not reachable. 
Moreover, the transition relation constraint T RConstr is part 
of initConstr and guarantees that, for any candidate repair 
returned by the SAT solver, the transition relation is total. W 

Theorem 3 (Completeness): If Algorithm 2 returns “Unre- 
alizable”, then the parameterized system has no repair. 

Proof: Algorithm 2 returns "Unrealizable” if accC'nstr A 
initConstr has become unsatisfiable. We consider an arbi- 
trary 6’ C 6 and show that it cannot be a repair. Note that 
for the given run of the algorithm, there is an iteration 7 
of the loop such that 6’, seen as an assignment of truth 
values to atomic propositions 6, was a satisfying assignment 
of accCnstr AinitConstr up to this point, and is not anymore 
after this iteration. 

If i = 0, i.e., 0’ was never a satisfying assignment, then 6’ 
does not satisfy initConstr and can clearly not be a repair. If 
i > 0, then ô’ is a satisfying assignment for initConstr and all 
constraints added before round 7, but not for the constraints 
A\sern, BuildConstr(s,[RE,—1,...,REo]}) added in this 
iteration of the loop, based on a reachable error sequence 
RE = RE,,..., REo. By construction of BuildConstr, this 
means we can construct out of 6’ and RE a concrete error 
path in Restrict(M, 6’), and 0’ can also not be a repair. W 


Algorithm 2 Parameterized Repair 


1: procedure PARAMREPAIR(M, ERR, InitConstr) 
2 accCnstr 4 InitConstr, isCorrect — False 
3 while isCorrect = False do 
4: isCorrect, [Eo,..., Ex] — MC(M, ERR) 
5 if isCorrect = False then 
6 RE, + Ex, //E, contains only initial states 
7 REg- +} Succe(REp)N TEk—1;,---; 
REg + Succe(RE1)N +Eo 
8: //for every initial state in REg compute its constraints 
9: newConstr <— NscRE, 


BuildConstr(s,[RE,_1,..., REo]}) 

10: //accumulate iterations’ constraints 

11: accCnstr + newConstr A accC'nstr 

12: //reset deadlock constraints 

13: ddlockCnstr <~ True 

14: 0, is SAT + SAT (accCnstr \ddlockCnstr) 

15: if isSAT = False then 

16: return Unrealizable 7 
//compute a new candidate using the repair 6’ 

17: M = Restrict(M, 6’) 

18: //if M reaches a deadlock get a new repair 

19: if HasDeadlock(M) then 

20: ddlockCnstr + 706! A ddlockCnstr 

21: jump to line 14 

22: else return ð- //a repair is found! 


1: procedure BUILDCONSTR(State s, RE) 

2: //s is a state, RE[1 :] is a list obtained by removing 
the first element from RE 

3: if RE[1 :] is empty then 
/lif ty € A'°°*(s) leads directly to error set, delete it (“ty 
must set to true by the SAT solver) 

return /\;,,cAtocai(s,re[o]) TÉU 

5: else 
//else either delete ty or delete outgoing transitions of the 
target state of ty recursively 

6: return Ni, cAtocat(s REJ) EU V 

BuildConstr(A(s, tu), REI :J)) 


Theorem 4 (Termination): Algorithm 2 always terminates. 
Proof: For a counter system based on A and B, the 
number of possible repairs is bounded by 2!*!. In every 
iteration of the algorithm, either the algorithm terminates, 
or it adds constraints that exclude at least the repair that is 
currently under consideration. Therefore, the algorithm will 
always terminate. E 


What can be done if a repair doesn’t exist? If Algorithm 
2 returns “unrealizable”, then there is no repair for the given 
input. To still obtain a repair, a designer can add more non- 
determinism and/or allow for more communication between 
processes, and then run the algorithm again on the new in- 
stance of the system. Moreover, unlike in monolithic systems, 
even if the result is “unrealizable”, it may still be possible to 
obtain a solution that is good enough in practice. For instance, 
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we can change our algorithm slightly as follows: When the 
SAT solver returns “UNSAT” after adding the constraints for 
an error sequence, instead of terminating we can continue 
computing the error sequence until a fixed point is reached. 
Then, we can determine the minimal number of processes me 
that is needed for the last candidate repair to reach an error, 
and conclude that this candidate is safe for any system up to 
size Me — 1. 


V. EXTENSIONS 
A. Beyond Reachability 


Algorithm 2 can also be used for repair with respect to 
general safety properties, based on the automata-theoretic 
approach to model checking. We assume that the reader is 
familiar with finite-state automaton and with the automata- 
theoretic approach to model checking. 


Checking Safety Properties. Let W = (S, Sọ, A) be a 
counter system of process templates A and B that violates 
a safety property y over the states of A, and let A = 
(Q4, qt, Qa, 64, F) be the automaton that accepts all words 
over Qa that violate y. To repair M, the composition M x A 
and the set of error states ERR = {((qa,c), ¢) | (qa, €) € 
SA q2 € F} can be given as inputs to the procedure 
Param Repair. 

Corollary 1: Let S4C (M x A) x (M x A) be a binary 
relation defined by: 


((ga,€), 4^) Sa (lda), d'^) @ eS nga = gana’ = d'^ 
then ((M x A), a) is a WSTS with effective pred-basis. 
Similarly, the algorithm can be used for any safety property 
y(A, B™) over the states of A, and of k B-processes. 
To this end, we consider the composition M x B® x A 
with M = (S, So, A), B = (QB, inits, Gp, ôB), and A = 
(Q4, qf, Qa X Qpr, 64, F) is the automaton that reads states 
of A x B® as actions and accepts all words that violate the 
property.’ 


Example. Consider again the simple reader-writer system in 
Figures 5 and 6 where we use the following abbreviations: 
(n)w for (non-)writing, and (n)r for (non-)reading. 
Assume that instead of local transition (nr, {nw},r) 
we have an unguarded transition (nr,Q,r). We want 
to repair the system with respect to the safety property 
p G|(w A nrı) =>> (nrıWnw)| where G,W are 
the temporal operators always and weak until, respectively. 
Figure 7 depicts the automaton equivalent to -=y. To 
repair the system we first need to split the guards as 
mentioned in Section II, i.e., (nr,Q,r) is split into 
(nr, {ar},r), (nr, {r},7r), (nr, {nw},r), and (nr, {w},r). 
Then we consider the composition C = M x B x A and we 
run Algorithm 2 on the parameters C, ((—,—, (*,*),q3!)) 
(where (—,—) means any writer state and any reader state, 
and x means 0 or 1), and T’RConstr Disj. The model checker 


7By symmetry, property (A, Bt)) can be violated by these k explicitly 
modeled processes iff it can be violated by any combination of k processes 
in the system. 


wAnry 


wAnry 


COO 


nw 


Fig. 7: Automaton for ~y 


in Line 4 may return the following error sequences, where 
we only consider states that didn’t occur before: 

Eo = {((-, -3 (x, *)), q')}. 

E, = {((w, r1, (0,0)), a7) }, 

E {((w, nr1, (0, 0)), g6), ((w, nr1, (0, 1)), gét), 
((w, NTI, (1, 0)), qo')}; 

Ez = {((nw, nri, (0,0)), q6"), ((nw,nrı, (0, 1)), q6"), 

((w, Ti, (0, 0)), qo), ((w, T1, (0, 1)), q6), ((w, rl, (1, 0)), qo‘) } 


In Line 14 we find out that the error sequence 
can be avoided if we remove the transitions 
{(nr, {nr},r), (nr, {r},r), (nr, {w},r)}. Another call 


to the model checker in Line 4 finally assures that the new 
system is safe. Note that some states were omitted from error 
sequences in order to keep the presentation simple. 


B. Beyond Disjunctive Systems 


Furthermore, we have extended Algorithm 2 to other sys- 
tems that can be framed as WSTS, in particular pairwise 
systems [2] and systems based on broadcasts or other global 
synchronizations [3], [19]. We summarize our results here, 
more details can be found in the extended version [17]. 

Both types of systems are known to be WSTS, and there 
are two remaining challenges: 


1) how to find suitable constraints to determine a restriction 
6’, and 
2) how to exclude deadlocks. 


The first is relatively easy, but the constraints become more 
complicated because we now have synchronous transitions 
of multiple processes. Deadlock detection is decidable for 
pairwise systems, but the best known method is by reduction 
to reachability in VASS [2], which has recently been shown to 
be TOWER-hard [20]. For broadcast protocols we can show 
that the situation is even worse: 

Theorem 5: Deadlock detection in broadcast protocols is 
undecidable. 

The main ingredient of the proof is the following lemma: 

Lemma 3: There is a polynomial-time reduction from the 
reachability problem of affine VASS with broadcast matrices 
to the deadlock detection problem in broadcast protocols. 

Proof: We modify the construction from the proofs of 
Theorems 3.17 and 3.18 from German and Sistla [2], using 
affine VASS instead of VASS and broadcast protocols instead 
of pairwise rendezvous systems. 

Starting from an arbitrary affine VASS G that only uses 
broadcast matrices and where we want to check if configura- 
tion (q2,C2) is reachable from (qi,¢1), we first transform it 
to an affine VASS G* with the following properties 
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e each transition only changes the vector c in one of the 
following ways: (i) it adds to or subtracts from c a unit 
vector, or (ii) it multiplies c with a broadcast matrix M 
(this allows us to simulate every transition with a single 
transition in the broadcast system), and 

e some configuration (q5,0) is reachable from some con- 
figuration (q/,,0) in G* if and only if (q2, c2) is reachable 
from (q1,¢€1) in G. 


The transformation is straightforward by splitting more com- 
plex transitions and adding auxiliary states. Now, based on 
G* we define process templates A and B such that A||B” 
can reach a deadlock iff (q5,0) is reachable from (qj, 0) in 
G5. 

The states of A are the discrete states of G*, plus additional 
states q’, q”. If the state vector of G* is m-dimensional, then B 
has states qi,...,@m, plus states init, v. Then, corresponding 
to every transition in G* that changes the state from q to q’ and 
either adds or subtracts unit vector u;, we have a rendezvous 
sending transition from q to q’ in A, and a corresponding 
receiving transition in B from init to q; (if u; was added), 
or from q; to init (if u; was subtracted). For every transition 
that changes the state from q to q’ and multiplies c with a 
matrix M, A has a broadcast sending transition from q to q’, 
and receiving transitions between the states q1,...,@m that 
correspond to the effect of M. 

The additional states q',q” of A are used to connect 
reachability of (q4,0) to a deadlock in A|| B” in the following 
way: (i) there are self-loops on all states of A except on qg’, 
i.e., the system can only deadlock if A is in q’, (ii) there is a 
broadcast sending transition from q4 to q’ in A, which sends 
all B-processes that are in qi1,...,@m to special state v, and 
(iii) from v there is a broadcast sending transition to init in 
B, and a corresponding receiving transition from q’ to q” in 
A. Thus, A||B” can only deadlock in a configuration where 
A is in q’ and there are no B-processes in v, which is only 
reachable through a transition from a configuration where A 
is in q2 and no B-processes are in qi,...,@m. Letting qı be 
the initial state of A and init the initial state of B, such a 
configuration is reachable in A||B” if and only if (q5,0) is 
reachable from (qi, 0) in G*. | 


Approximate Methods for Deadlock Detection. Since solv- 
ing the problem exactly is impractical or impossible in general, 
we propose to use approximate methods. For pairwise systems, 
the 01-counter system introduced as a precise abstraction for 
disjunctive systems in Sect. III-C can also be used, but in this 
case it is not precise, i.e., it may produce spurious deadlocked 
runs. Another possible overapproximation is a system that sim- 
ulates pairwise transitions by a pair of disjunctive transitions. 
For broadcast protocols we can use lossy broadcast systems, 
for which the problem is decidable [21].8 Another alternative 
is to add initial constraints that restrict the repair algorithm 
and imply deadlock-freedom. 


8Note that in the terminology of Delzanno et al., deadlock detection is a 
special case of the TARGET problem. 


VI. IMPLEMENTATION & EVALUATION 


We have implemented a prototype of our parameterized 
repair algorithm that supports the three types of systems (dis- 
junctive, pairwise and broadcast), and safety and reachability 
properties. For disjunctive and pairwise systems, we have 
evaluated it on different variants of reader-writer-protocols, 
based on the ones given in Sect. I,II, where we replicated 
some of the states and transitions to test the performance of 
our algorithm on bigger benchmarks. For disjunctive systems, 
all variants have been repaired successfully in less than 2s. For 
pairwise systems, these benchmarks are denoted “RW: (PR)” 
in Table I. A detailed treatment of one benchmark, including 
an explanation of the whole repair process is given in the 
extended version [17]. 

For broadcast protocols, we have evaluated our algorithm 
on a range of more complex benchmarks taken from the 
parameterized verification literature [22]: a distributed Lock 
Service (DLS) inspired by the Chubby protocol [23], a dis- 
tributed Robot Flocking protocol (RF) [24], a distributed 
Smoke Detector (SD) [19], a sensor network implementing 
a Two-Object Tracker (2OT) [25], and the cache coherence 
protocol MESI [26] in different variants constructed similar 
as for RW. 

Typical desired safety properties are mutual exclusion and 
similar properties. Since deadlock detection is undecidable for 
broadcast protocols, the absence of deadlocks needs to be 
ensured with additional initial constraints. 

On all benchmarks, we compare the performance of our 
algorithm based on the valuations of two flags: SEP and EPT. 
The SEP (“single error path”) flag indicates that, instead of 
encoding all the model checker’s computed error paths, only 
one path is picked and encoded for SAT solving. When the 
EPT (“error path transitions”) flag is raised the SAT formula is 
constructed so that only transitions on the extracted error paths 
may be suggested for removal. Note that in the default case, 
even transitions that are unrelated to the error may be removed. 
Table I summarizes the experimental results we obtained. 

We note that the algorithm deletes fewer transitions when 
the EPT flag is raised (EPT=T). This is because we tell the 
SAT solver explicitly not to delete transitions that are not on 
the error paths. Removing fewer transitions might be desirable 
in some applications. We observe the best performance when 
the SEP flag is set to true (SEP=T) and the EPT flag is 
false. This is because the constructed SAT formulas are much 
simpler and the SAT solver has more freedom in deleting 
transitions, resulting in a small number of iterations. 


VII. RELATED WORK 


Many automatic repair approaches have been considered 
in the literature, most of them restricted to monolithic sys- 
tems [11], [12], [27]-[30]. Additionally, there are several 
approaches for synchronization synthesis and repair of con- 
current systems. Some of them differ from ours in the un- 
derlying approach, e.g., being based on automata-theoretic 
synthesis [31], [32]. Others are based on a similar underlying 
counterexample-guided synthesis/repair principle, but differ in 
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TABLE I: Running time, number of iterations, and number of deleted transitions (#D.T.) for the different configurations. Each 
benchmark is listed with its number of local states, and edges. We evaluated the algorithms on different sets of errors with 
P, U Pp = C where Pı and P, are two distinct error sets that differ from one benchmark to another. Smallest number of 
iterations, runtime per benchmark, deleted transitions are highlighted in boldface. 


Benchmark Size Errors [SEP=F & EPT=F] [SEP=T & EPT=F] [SEP=F & EPT=T] [SEP=T & EPT=T] 
States Edges #Iter Time #D.T. #Iter Time #D.T. #Iter Time #D.T. #Iter Time #D.T. 
RWI (PW) 5 12 C 3 25 4 3 2.9 4 2 1.7 2 2 1.7 2 
RW2 (PW) 15 42 Cc 3 3.8 14 3 4.8 14 2 3.2 7 7 8.4 7 
RW3 (PW) 35 102 Cc 3 820.7 34 3 7.6 34 2 552.3 17 17 40.3 17 
RW4 (PW) 45 132 C TO TO TO 3 11.8 44 TO TO TO 22 99.2 22 
DLS 10 95 P1 1 0.8 13 1 0.8 13 3 2.4 5 5 5.6 5 
DLS 10 95 P2 1 0.8 13 2 1.7 13 3 2.6 9 7 5.5 9 
DLS 10 95 Cc 2 4.2 13 2 1.5 13 3 3 9 9 8.1 9 
RF 10 147 P1 1 2:5 32 1 1.2 32 TO TO TO 8 12.4 13 
RF 10 147 P2 1 1.2 32 1 1.3 32 TO TO TO 8 11.3 14 
RF 10 147 Cc 1 7.8 32 1 1.4 32 TO TO TO 8 12.5 12 
SD 6 39 Cc 1 1 4 1 1 4 3 2.4 4 3 3 4 
20T 12 128 Pl 12 18.8 26 6 8.3 26 16 73.8 17 16 34 17 
20T 12 128 P2 1 1.8 26 1 1.8 26 4 2958 11 8 16.5 12 
2OT 12 128 C 11 17.2 Unreal. 6 11.7 Unreal. TO TO TO 11 48.6 Unreal. 
MESI1 4 26 Cc 1 2.4 6 1 0.9 6 2 1.8 5 4 3.5 5 
MESI2 9 71 Cc 1 1.1 26 1 1.1 26 3 56.4 20 6 6.8 15 
MESI3 14 116 Cc 1 109.4 46 1 108.1 46 TO TO TO 6 289.9 15 


other aspects from ours. For instance, there are approaches that 
repair the program by adding atomic sections, which forbid 
the interruption of a sequence of program statements by other 
processes [13], [33]. Assume-Guarantee-Repair [34] combines 
verification and repair, and uses a learning-based algorithm to 
find counterexamples and restrict transition guards to avoid 
errors. In contrast to ours, this algorithm is not guaranteed 
to terminate. From lazy synthesis [35] we borrow the idea to 
construct the set of all error paths of a given length instead of 
a single concrete error path, but this approach only supports 
systems with a fixed number of components. Some of these 
existing approaches are more general than ours in that they 
support certain infinite-state processes [13], [33], [34], or 
more expressive specifications and other features like partial 
information [31], [32]. 

The most important difference between our approach and 
all of the existing repair approaches is that, to the best of 
our knowledge, none of them provide correctness guarantees 
for systems with a parametric number of components. This 
includes also the approach of McClurg et al. [14] for the 
synthesis of synchronizations in a software-defined network. 
Although they use a variant of Petri nets as a system model, 
which would be suitable to express parameterized systems, 
their restrictions are such that the approach is restricted to 
a fixed number of components. In contrast, we include a 
parameterized model checker in our repair algorithm, and can 
therefore provide parameterized correctness guarantees. There 
exists a wealth of results on parameterized model checking, 
collected in several good surveys recently [36]-[38]. 


VIII. CONCLUSION AND FUTURE WORK 


We have investigated the parameterized repair problem for 
systems of the form A||B” with an arbitrary n € N. We intro- 
duced a general parameterized repair algorithm, based on inter- 
leaving the generation of candidate repairs with parameterized 


model checking and deadlock detection, and instantiated this 
approach to different classes of systems that can be modeled 
as WSTS: disjunctive systems, pairwise rendezvous systems, 
and broadcast protocols. 

Since deadlock detection is an important part of our method, 
we investigated this problem in detail for these classes of 
systems, and found that the problem can be decided in 
EXPTIME for disjunctive systems, and is undecidable for 
broadcast protocols. 

Besides reachability properties and the absence of dead- 
locks, our algorithm can guarantee general safety properties, 
based on the automata-theoretic approach to model checking. 
On a prototype implementation of our algorithm, we have 
shown that it can effectively repair non-deterministic overap- 
proximations of many examples from the literature. Moreover, 
we have evaluated the impact of different heuristics or design 
choices on the performance of our algorithm in terms of repair 
time, number of iterations, and number of deleted transitions. 

A limitation of the current algorithm is that it cannot 
guarantee any liveness properties, like termination or the 
absence of undesired loops. Also, it cannot automatically add 
behavior (states, transitions, or synchronization options) to the 
system, in case the repair for the given input is unrealizable. 
We consider these as important avenues for future work. 
Moreover, in order to improve the practicality of our approach 
we want to examine the inclusion of symbolic techniques for 
counter abstraction [39], and advanced parameterized model 
checking techniques, e.g., cutoff results for disjunctive sys- 
tems [6], [40], [41], or recent pruning results for immediate 
observation Petri nets, which model exactly the class of 
disjunctive systems [42]. 
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Abstract—Scalable protocols and web services are typically 
parameterized: that is, each instance of the system is formed by 
linking together isomorphic copies of a representative process. 
Verification of such systems is difficult due to state explosion 
for large instances and the undecidability of verifying properties 
over all instances at once. This work turns instead to the 
derivation of a parameterized protocol from its specification. We 
exploit a reduction theorem showing that it suffices to construct 
a representative process P that meets a local specification 
under interference by neighboring copies of P. Every instance 
of the parameterized protocol is built by deploying replicated 
instances of P. While the reduction from the original to a local 
specification is done by hand, the construction of P is fully 
automated. This is a new and challenging synthesis question, as 
one must synthesize an unknown process P while simultaneously 
considering interference by copies of this unknown process. We 
present two algorithms: an eager reduction to the synthesis 
of a transformed specification, and a lazy, iterative, tableau 
construction which incorporates fresh interference at each step. 
The tableau method has worst-case complexity that is exponential 
in the length of the local specification. We have implemented the 
tableau construction and show that it is capable of synthesizing 
parameterized protocols for mutual exclusion, leader election, 
and dining philosophers. 


I. INTRODUCTION 


Scalable systems, such as network communication proto- 
cols, distributed algorithms, and multi-core hardware models, 
are typically parameterized — that is, they are composed of 
many isomorphic copies of a representative process. These 
processes interact with each other according to an underlying 
communication scheme. Automated verification of such sys- 
tems quickly runs into state explosion with increasing instance 
size, as an instance with K processes can have a reachable 
state space that is exponential in K. The alternative of “once 
and for all” verification of all instances at once is undecidable 
in general [1]. 

In this work, we turn instead to the construction (synthesis) 
of a parameterized system from its specification. The key to 
the presented methodology is a compositional (i.e. assume- 
guarantee) reduction theorem from [2] which exploits the 
symmetry inherent in these systems, showing that it suffices 
to verify that a localized property holds of a representative 
process P under interference from neighboring copies of P. 
The first step of the methodology is to reduce the global 
specification of the desired parameterized system to a localized 
property. This reduction varies by application, as the global 
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specification is itself parameterized and quantified (e.g., “all 
instances satisfy mutual exclusion”) while the local specifi- 
cation is quantifier-free. The second step is to synthesize an 
appropriate process P from the local specification, which is 
carried out automatically. 

This synthesis question is of a new and challenging type. 
The standard formulation of temporal synthesis is to construct 
a process P satisfying a given temporal specification y. How- 
ever, our reduction requires the construction of a process P 
whose closure under interference by copies of (the unknown) 
P satisfies a temporal specification vy. That is, the synthesis 
procedure must somehow derive a suitable process while 
simultaneously taking into account the effects of interference 
by adjacent copies of this unknown process. Every instance of 
the protocol is built by deploying replicated instances of the 
synthesized P. 

We provide two algorithms for the synthesis question. The 
first is an ‘eager’ method that transforms a given specifica- 
tion y to a new specification Z(y) which incorporates self- 
interference; one can then apply standard synthesis methods 
to Z(y). The second is a ‘lazy’ method which iteratively 
constructs a sequence of tableaux starting with a tableau for 
p; at each iteration, the current tableau is extended with 
interference transitions. The limit tableau is then pruned to 
obtain the solution. Although the eager method is direct, the 
transformation from ọ to Z(y) always incurs an exponential 
blowup in the number of proposition symbols in y. For this 
reason, we implement the lazy method and show that it can 
synthesize solutions for mutual exclusion, leader election, and 
dining philosophers specifications. 

This approach does not provide a complete solution to the 
parameterized synthesis question, for several reasons. The first 
is that the reduction from a quantified global specification to 
an unquantified local specification is carried out by hand. The 
second is that the process P to be derived can only have a 
fixed-size neighborhood, as otherwise one would require an 
unbounded quantification over the neighbors of P. Hence, 
the method can derive solutions for rings, tori, wrap-around 
mesh, and other networks where the degree of a node is 
independent of the number of nodes in an instance. (The 
use of localized abstractions, e.g., [3], may help bypass 
this limitation; we plan to investigate this in future work.) 
Finally, both algorithms produce a process P where any two 
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Fig. 1. The tile of the dining philosophers protocol. 


states that satisfy the same propositions have identical future 
behavior. This rules out the synthesis of auxiliary state beyond 
that defined by propositional valuations. Nonetheless, despite 
these limitations, one can synthesize correct-by-construction 
parameterized protocols for the specifications listed above. 

In the sequel, for ease of exposition, we limit attention 
to parameterized protocols on ring networks. Local sym- 
metry ensures that a single representative process suffices. 
This process has two neighbors, one to the left and one to 
the right. Specifications are expressed in CTL, augmented 
with unconditional fairness on process schedules. The tableau 
method builds on the classical tableau constructions for CTL 
and Fair CTL and their associated synthesis procedures based 
on pruning states and transitions from the tableau. 

It is worth noting that while the construction of a single 
instance with a fixed number of processes is a closed synthesis 
question, the derivation of a representative for all instances is 
an open synthesis question. 


II. PRELIMINARIES 
A. Rings: Structure, Semantics, and Interference 


A ring of size K is a directed graph with node set N = 
[0..K), and edge set {£;} for i € [0..K). Node i is connected 
to edges E; (on its left) and F(;+1) (on its right). (Arithmetic 
is implicitly modulo K.) Edge E; is connected to nodes (i— 1) 
(on its left) and 2 (on its right). Two nodes are neighbors if 
they have a common connected edge. The set of neighbors of 
node i is denoted nbr(i). 

The parameterized networks of interest are uniform rings of 
arbitrary size, in that the process at each node is a copy of 
a single ‘tile’ process (cf. [2]). Figure 1 shows a tile and the 
construction of an instance through replication. The external 
variables of a process are those assigned to adjacent edges. 
(In the figure, an incoming arrow represents read access; an 
outgoing arrow represents write access.) A process may also 
have internal state variables assigned to the node. 

For readability, we denote the representative process by Pn, 
so we can speak of its neighbors as P,_, and Pa+ı (and 
either of them as Pm). It is important that P,, is not viewed 
as the n’th process in a particular instance but rather as the 
representative process for all instances. 

The external and internal variables of P,, together form the 
state space of Pp, which is the collection of valuations to these 
variables. The state machine for P, is a tuple (Sn, 59. Ties Avi) 
where S» is the state space; S° is a non-empty set of initial 


states; Tn C Sn X Sn is a transition relation; and A, : Sn > 
2®n is a function that labels each state with a subset of atomic 
propositions from the set Uy. 

As defined, the state machine of P,, is a labeled state 
transition system that describes the behavior of the repre- 
sentative process alone in its neighborhood. A neighbor can 
interfere with P,, by changing the values of commonly shared 
(necessarily external) variables. A joint state is a pair of states 
(s,t), with s from P, and t from a neighbor P, that agree on 
the valuation to their shared variables. A joint transition from 
joint state (s,t) to joint state (s’,t’) by process P,, is defined 
if (t,t) is in Tm and the values of variables of P, that are 
not shared with Pm are equal in s and s’. We say that (s, s”) 
is an interference transition caused by Pm. For example, ‘Pm 
passes a token to P,’ is an interference transition. 

We denote the ¿th copy of the representative process Pp 
in an instance by P;. The K process instance formed by 
copies Py... Px -1 has the global state transition relation 
G = (S, S?,T, A). Here each state s € S is a valuation to the 
internal variables of each process, together with a valuation to 
the external edge variables; S° is a non-empty set of initial 
states, where each state in S° projects to an initial state of P;, 
for all 7. The transition relation T' defines non-deterministic 
interleaving: (s,7,s’) is in T if (s[é],s’[é]) is in T; and the 
value of any variables not in process P; is the same in s and 
s’. Here, the notation s[i] represents the projection of s on 
the variables of P;. The labeling À of a state s is the indexed 
union of all local labelings A; (s[i]). 

From G one can define a machine G; by projecting out 
the labels of transitions other than those of the i’th process. 
I.e., consider a transition (s,k,s’) of G. If k = i, retain the 
transition as is; otherwise, replace the label with 7. 

The effect of interference on Pp, is given by a transition 
system HÊ defined in [2]; we repeat the definition here. A 
compositional inductive invariant 0 of an instance is a set of 
local assertions {@,,} with the following properties: for every 
n, (1) On includes the initial states of P,,; (2) transitions by P,, 
preserve 6,,; and (3) interference transitions by Pm from joint 
states satisfying 0n» and Ôm preserve 0n. These properties can 
be converted to simultaneous pre-fixpoint form over {6,,}. By 
the Knaster-Tarski theorem, the least fixpoint is the strongest 
compositional invariant, denoted by 6*. 

States of He? are the local states S,, that satisfy 0n; 
transitions of H are of two types: (1) a transition by P,, 
denoted (s, n, s’), where 0» (s) holds and (s, s’) is in Tah, and 
(2) an interference transition denoted (s,m, s’) representing 
a transition by P,, from a joint state (s,t) where 0,,(s) and 
Om(t) hold, to a joint state (s’, t’). 

This transition system is linked to the global transition 
system with respect to local properties. 


Theorem II.1. ( /2/) H? stuttering-simulates G; for every i. 
Moreover, if H? satisfies an ‘outward-facing’ restriction, then 
H? and G; are stuttering-bisimular. 


The systems are equivalent only up to stuttering as HÊ 
does not take into account transitions by processes ‘far away’ 
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from position i, while G of course contains all transitions. 
The outward-facing restriction says (informally) that the in- 
terference by a neighboring process m depends only on the 
valuation of the variables shared by P,, and P,,. 

The transition system He induced by the strongest compo- 
sitional invariant 6* is of special interest; we abbreviate it as 
H;. It is constructed by an inductive, least fixpoint process. 
(1) The initial structure H,, consists of the initial states of P,,. 
Apply steps (2) or (3) in any fairly interleaved order until no 
new transitions can be added; the result is Hž. Step (2) applies 
an enabled transition of P,, to a reachable state of H,,,, labeling 
it by n. Step (3) views the currently reachable n-transitions 
of H,, as transitions from its (isomorphic) neighboring copy 
Hm and adds an enabled interference transition to a reachable 
state of Hn, labeling it by m. 


B. Local Fair CTL 


Let the scheduling of the process network be uncondition- 
ally fair. We use fair computation tree logic (Fair CTL) [4] to 
represent a local correctness property Yn, e.g., ‘Pa accesses 
the shared resource if P, owns the token’. The induced para- 
metric global correctness property is the conjunction /\; yj. 

Syntax. The language of Fair CTL contains Xn, Boolean 
operators =, A, V, =, <, linear time temporal operators Xn 
(process indexed strong next-time), Y,, (process indexed weak 
next-time), G (always), F (sometime), U (until), W (dual of 
until), and path quantifiers A (for all paths), E (there exists a 
path). 

We have the following syntax for Fair CTL. If p € Xn, then 
p is a formula. If f,g are formulae, then so are =f, f Ag, 
LV g9, f = 9 f © g, nf, EXnf, AGS, EGF, AF f, EFS, 
AfUg, EfUg, AfWg,and EfWg. 

As given in [5], we use indexed next-time operators Xn 
and Yn in place of the unindexed ones, where X,,f means 
that the immediate successor state s’ (along any maximal 
path designated by a path quantifier) is reached by executing 
one step of Pa, and f is true in s’; and Y,,f means that 
if the immediate successor state s’ (along any maximal path 
designated by a path quantifier) is reached by executing one 
step of P,, then f is true in s’. 

Globally, unconditionally fair scheduling asserts that all 
processes are selected for execution infinitely often by the 
scheduler. Locally, the fairness assumption is expressed as 
‘P„ and its neighbors are executed infinitely often’. The path 
quantifiers A and E in Fair CTL are subscripted by the fixed 
local fairness assumption, ®, indicating that quantifications are 
performed only on fair paths. 

In Fair CTL, a path quantifier is followed by a linear- 
time temporal operator. The pairs are the basic modalities. 
A formula whose basic modality is AU, EU, AaF, EF, 
or E@G is an eventuality formula corresponding to a liveness 
property. Formulae AgG are invariants corresponding to safety 
properties. In addition, we assume all formulae are converted 
into positive normal form, which means the negations are 
driven inwards to atomic propositions. 


Semantics. A local Fair CTL formula Yn is interpreted 
on the local state transition system H; and the global state 
transition system Gn. Let M = (S, S?,T, A) be a structure. 
A path, m = (80, 51,...), is a sequence of states such that 
(si, 8:41) € T for all i, and nÍ = (7;,7;41,...) is the suffix 
of 7 starting at state mj. A full path is an infinite path, and 
self-loops are allowed. A full path is fair iff it satisfies ®. 

We use M,s Hə f to mean that the formula f is true in 
M at state s under the fairness assumption ®. We define =e 
inductively as follows: 


e M,s Ke p iff p € A(s) for atomic proposition p. 
e M,s = —f iff not (M, s Fa f). 

e M,sKofAg iff M,s Ee f and M,s Ke g. 
Eo EoXnf iff there exists 7 = (so, s1, .. 

that (so, s1) € Tn, M, m = ®, and M, sı Fa f. 
e M, so =p AsYn f iff for all 7 = (so, s1, ...), if (s0, s1) € 

Tn and M, r = ®, then M, sı Ka f. 

e M,so = Eo(fUg) iff there exists m = (so, s1,..-), 
such that M,a |= ®, and there exists i > 0, such that 
M, si Fe g, and for all 0 < j < i, M, sj Fa f. 

M, so Fe Aa (fUg) iff for all r = (so, 51,...), if M, m H 
®, then there exists ¿ > 0, such that M, si =a g, and for 
all 0< j <i, M, sj Fe f. 

By abbreviations, f V g = =~(~f A 7g), A(fWg) = 7E 
(~fU-g), E(fWg) = -A(—fU-g), AG = ~EF-=f, and EG = 
=AF-f (hence, AFf = A(trueUf), EFf = E(trueUf), 
AGf = A(falseWf), and EGf = E(falseWf)). A formula f 
is satisfiable iff there exists a model M such that M, s Fa f 
for some state s of M. 


.), such 


C. Fairness and Outward-Facing 


The local fairness assumption ® = F°ez, A Am Fer. 
The path formula F°ex,, asserts that P, is selected for 
execution infinitely often by the scheduler. The infinitary 
linear time operator F°° abbreviates GF and is interpreted as 
M, = Fg iff for every i > 0, there exists 7 > i, such that 
M,rI =g. 

Formally, outward-facing is defined relative to ®, extending 
the definition in [2]. Let s and t be two states on Hž; s and t 
are related by a relation B,, m if s[e] = t[e] for every common 
connected edge e between n and m. The notation s[e] denotes 
the value of the external variable assigned to e at s. Process 
P, is outward-facing in its interactions with Pm if By jm is a 
stuttering bisimulation on Hž. 


D. Parameterized Synthesis 


We can now explain precisely how the reduction theorem 
supports parameterized synthesis. 


Theorem II.2. Let pn be a local FairCTL specification. 
Let P„ be a process such that its derived H; satisfies pn. 
Every instance of the parameterized system constructed from 
isomorphic copies of P, satisfies the global property N; i. 


Proof. Consider P„ and its induced Hž which satisfies the 
local correctness property Yn. By symmetry, each copy P; of 
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the representative P,, has an isomorphic H* which satisfies 
the corresponding y;. 

Consider an instance of the parameterized system con- 
structed from isomorphic copies of Pa. Let G be the global 
state space of the instance. Let 7 be a node of the instance. 
By Theorem I.1, G; satisfies y;; hence, by the locality of 
Pi, it follows that G satisfies y;. As this holds for every 
node, G satisfies the global property (A; yi). By the first 
part of Theorem II.1, this ‘inflationary’ consequence holds for 
any universal Fair CTL property. It holds for all Fair CTL 
properties if H* is outward-facing. 


The synthesis procedures of the following sections will, in 
effect, simultaneously construct both the strongest invariant 6* 
and the resulting H;>. 


III. EAGER SYNTHESIS 


We describe the eager method of synthesizing a representa- 
tive process P, whose interference closure H; satisfies the 
Fair CTL formula pn. The atomic propositions in yn are 
divided into two disjoint groups: X, representing properties of 
the external state, and L, representing properties of the internal 
state. We use a,b,a’,b’ to refer to valuations of variables in 
X, and k,l,k’,l’ to refer to valuations of variables in L. The 
notation X = a means that each variable in X has the value 
given to it in a. 

Given a local property Yn, the eager method produces a Fair 
CTL formula Z (pn) that is a conjunction of Yn with several 
constraints. The constraints are expressed in CTL extended 
with the modal operators (c) and its negation dual [c], where 
(c) f is the set of states from which there is a transition labeled 
c to a state satisfying f. It is straightforward to adjust the Fair 
CTL synthesis procedure for this variant of the EX operator. 

The candidate models are labeled transition systems where 
transitions are labeled either by n (the representative) or by m 
(a neighbor). States are labeled with propositions from X and 
L. The constraints added to ¢y,,, intuitively, make the models 
*look’ similar to Hž. 

A pair (a,a’) of valuations to X is an interference pair if 
EF((X = a) A (n)(X = a’)) holds at the initial state of a 
candidate model; i.e., if there is a reachable state labeled a 
with an n-successor labeled a’. By symmetry, the n-transition 
producing this pair may be viewed as an m-transition of a 
neighbor. A pair (b, b’) of valuations to X is considered the 
result of interference by (a,a’) viewed as a neighboring m- 
transition if (1) the X-variables shared between m and n have 
the same valuations in b and a, and in b’ and a’, and (2) the X- 
variables not shared between m and n have the same valuation 
in b and b’. The set of such pairs is denoted tm (a, a’). 

The Fair CTL formula Z (pn) is the conjunction of Yn with 
the constraints (1)-(4) given below. The added constraints are 
expressible in CTL as X and L have finitely many valuations. 


1) Every interference pair induces an interference transi- 
tion at all matching states. I.e., for every interference 
pair (a,a’) and every (b,b’) in tm(a,a’), the property 
AG((X =b) = (m)(X = b')) holds. 


2) m-transitions do not modify local state. I.e., AG((L = 
1) => [m](Z = 1)) for every valuation | of the local 
propositions. 

3) Every m-transition is induced by an interference pair. 
I.e., for every b,b’ such that EF((X = b) A (m)(X = 
b’)), there is an interference pair (a,a’) such that 
(b, b') E€ tm(a, a’). 

4) States with the same propositional label have similar 
successors. I.e., for c ranging over m and n: if EF((X = 
aN L=l1)A (c)\(X =a A L = l')) holds, then 
AG(X =a A L=1) > (c)\(X =a ^A L=l')). 

A specification is realizable if it has a satisfying model. 


Theorem II.1. Z(y,,) is realizable if and only if there is 
a process P, with state space 2* x 2" whose interference- 
closure H% satisfies pn. 


Proof. We show that any solution to the right-hand condition 
induces a solution to Z(y,,), and vice-versa. 

From right-to-left, consider a process P,, meeting the right- 
hand condition. We claim that H% satisfies conditions (1)-(4) 
by its inductive construction. If an interference pair (a, a’) be- 
comes reachable at some stage of the construction, it is used to 
construct interference transitions at all subsequent stages; thus, 
condition (1) holds. Interference transitions do not modify 
local state, meeting condition (2). Moreover, all interference 
transitions stem from an interference pair introduced at an 
earlier stage, meeting condition (3). Finally, as the closure is 
defined over the same state space as P,,, there is a unique state 
for each propositional labeling, satisfying condition (4). 

The proof for the left-to-right direction is more involved, as 
we cannot a priori restrict the models of Z(y,,) to the state 
space 2* x 2". Thus, consider any model Mo of Z(y,,). We 
may assume that every transition of Mp is reachable. (If not, 
limiting Mo to its reachable state space still satisfies Z(y,,).) 

Let ~ be the relation defined by s ~ t if states s and t 
satisfy the same propositions. Condition (4) implies that ~ 
is a strong bisimulation on Mo. (Proof: Consider states s,t 
such that s ~ t and a c-successor s’ of s. Let a,l be the 
propositions over X and L (respectively) that are satisfied by 
s, and let a’,l’ be the corresponding propositions satisfied by 
s’. The transition from s to s’ is a witness to the assumption 
of (4); hence t must have a c-successor t’ satisfying a’, l’. By 
definition, s’ ~ t’ holds.) 

Let M; be the quotient of Mo under ~. As ~ is a strong 
bisimulation, Mo and M; are strongly bisimular; hence, both 
satisfy the same Fair CTL formulas; in particular, Mı also 
satisfies Z(y,). Let process P be the subgraph formed by 
the n-transitions of Mı. We show that Mi is the interference 
closure of P. 

Note that by the definition of ~ and the quotient construc- 
tion, every propositional valuation is associated with at most 
one state of Mı, so we can consider MM; to be isomorphic to 
a process with state space 2% x 2”. 

We first show that the interference closure of P is a 
subgraph of Mı, by induction on the stages of the closure 
construction. Initially, that is true as P is a subgraph of 
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Mı. Suppose that this condition holds at the current stage. 
Consider the transition added at the next step. If this is a 
transition of P, it is already present in Mı. If the transition 
is an interference transition applied at a state s, it must be 
derived from an n-transition present at the current stage. By 
the induction hypothesis, the inducing n-transition and the 
state s both belong to Mı. By conditions (1) and (2), the 
derived interference transition from s also belongs to Mı. It 
follows that the closure process constructed as the limit of 
these steps is a subgraph of Mı. 

We also need to rule out the existence of transitions in M4 
that are not in the closure process. Let t be a transition of 
My, from a state (b, k) to (b’, k’). If this is an n-transition, it 
belongs to P and hence to the closure. Consider the case where 
it is an m-transition. By (2), k’ must equal k. From (3), there 
is an interference pair (a, a’) in M; induced by an n-transition 
t such that (b,b’) € tm(a,a'). The n-transition t is in P by 
definition and hence in the closure. Therefore the interference 
transition t induced by t’ is also in the closure. 


The eager method is technically interesting as it transforms 
the new, self-referential synthesis question into a standard 
form, simply by adding constraints that encode interference. 
However, the transformation results in an exponential blowup 
as the added constraints range over all propositional valuations. 
Hence, this method is likely to be impractical. The following 
section formulates a lazy procedure that gradually introduces 
interference into a tableau of the original formula. 


IV. THE TABLEAU APPROACH 


A tableau of n is a tuple m = (Vn, R, L), where V, is 
a set of nodes; R is a transition relation over Vp, and L : 
V,, — 2P"°P is a labeling function. A tableau has two types 
of nodes, Vp = VĮ U VP such that VC N VP = Ø, where 
Ve is a set of AND-nodes that are potential states of Ph, 
and V? is a set of OR-nodes. The transition relation R = 
RPO U ROP, where RPO C VP x VE, ROP C VE x VP, 
and transitions in RC? are labeled with n or m € nbr(n). 
Each node v,, € Vn is labeled with a subset of Prop, where 
Prop is the extended Fischer-Ladner closure of pn [6], [7]. 
The closure Prop describes the negation, subset, and fixpoint 
closure of the temporal operators. 

We adopt the two-pass tableau approach of [8], [4], i.e., 
first construct a tableau from the specification, then prune and 
unravel the tableau into a model. The local property y,, of 
interest is in the format of init-spec A other-spec. Hence, 
init-spec specifies a single initial state. For multiple initial 
states, a set of local properties {9} , y}, ...} is generated, each 
with the same other-spec but a different inzt-spec. 

We modify the classical tableau approach to synthesize H% 
from pn, such that Hž is outward-facing and closed under 
interference. Subsection IV-A shows how to derive the initial 
tableau 7,° closely following the original procedure [8]. Our 
main innovation is that we assume the neighbors are isomor- 
phic copies of 7,’ and subsection IV-B shows how to construct 
T$! by adding interference transitions to 7,’. The iterative 
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procedure continues until a fixpoint tableau 7,* is reached such 
that 7, is closed under interference by isomorphic copies of 
T. We then apply deletion rules (in Subsection IV-C), extract 
a model Hž from the pruned fixpoint tableau, and obtain P,, 
from Hž by removing interference transitions (in Subsection 
IV-D). These steps follow the original tableau procedure with 
slight variations. 


A. The Initial Tableau 


Similar to the classical tableau approach [8], [4], the root of 
the tableau, d,oot, is an OR-node labeled with {pn}. Starting 
with d,.oo¢, the initial tableau TP is constructed by repeatedly 
creating successors and appending them to the leaf nodes. In 
the case of duplicate labels and types, the newly created node 
is merged with the existing node, i.e. the new node is deleted 
and its incoming and outgoing edges are added to the existing 
node. The construction of 7? terminates when there are no 
more leaf nodes. If there are multiple initial states, we repeat 
the steps of constructing the initial tableau with different init- 
specs while merging duplicates. 

For each OR-node d, blocks(d) is a set of successors of d 
such that each AND-node c; € blocks(d) represents a way of 
satisfying the formulae in L(d). The generation of blocks(d) 
follows the classical tableau approach, with a slightly different 
a-( expansion: as listed in Table I (c.f. [6]), most expansions 
are binary, except for AY and EX, which expand to a list 
of operators indexed by n and the neighbors in nbr(n). The 
unindexed next-time operators are not part of y, but can be 
added to node labels during formula expansion. Formulae in 
L(d) are satisfiable iff there exists a node in blocks(d) whose 
label is satisfiable. 

For each AND-node c, tiles(c) is the minimal set of n- 
successors Of c, i.e, the next-time states reachable through 
transitions labeled with n. Let CA, = {f | AaYnf € L(c)} 
and CE, = {g | EaXng € L(c)}. For each g € CEn, an 
OR-node labeled C'A,, U {g} is created as a successor node 
of c. Edges from c to nodes in tiles(c) are labeled with n. 
Here, we only consider a single edge case. I.e., if both C An 
and C'E,, are empty sets, then we add a ‘dummy’ successor 
dn to c and set blocks(d,,) = {c}. If L(c) is satisfiable, then 
the labels of all nodes in tiles(c) are satisfiable. 

In the classical tableau approach, for each neighbor m, the 
set of m-successors of c are created in a similar way to n- 


239 


successors. However, since the local property Yn only specifies 
the behavior of Pa, interference transitions by neighboring 
processes Pm are not specified in pn. Instead, we infer the 
transitions labeled with m based on transitions labeled with 
n. The next subsection shows the detailed steps of adding 
interference transitions and m-successors. 


B. The Fixpoint Tableau 


Starting from the initial tableau 7,° containing only transi- 
tions labeled with n, we construct 7,’*! from 7,’ through the 
following steps. 

First, we summarize the interferences contained in the 
tableau so far. We search 7,’ for n-transitions that change the 
values of shared variables and convert these n-transitions to a 
set of m-transitions for each neighbor m by bijection. That is, 
for each pair of AND-nodes c and c’ such that c’ € blocks(d) 
for d € tiles(c), let Y and Y’ be the values of the shared 
variables in L(c) and L(c’), respectively. If Y” is different from 
Y, then we use one or more tuples (m, Ym, Y,7,) to record m- 
transitions that change the values of shared variables between 
n and m from Ym to Y. 

Next, we add interference transitions to the current tableau. 
For each unique tuple (m, Ym, Y;,), we add the interference 
transition to each applicable AND-node and label the transition 
with m. An AND-node c in 7,’ is applicable to an interference 
transition (m, Ym, Y,,) if the values of the shared variables in 
Y,n match those in L(c), and the interference transition is not 
already added to c. 

For each AND-node c, bricksm (c) is a possibly empty set of 
m-successors of c, and bricks(c) = Umenbr(n) Orick$m(C). 
An empty bricks,,(c) indicates an implicit self-loop by m in 
c, i.e., transitions labeled with m do not interfere with n in c. 

The set bricks,,(c) is generated as follows. Let CAm = 
{f |AeYnf E€ L(c)} and CE, = {g |EaXmg E L(c)}. 
These m-indexed properties are not sub-formulae of y,, but 
are added to node labels as a result of a-({ expansion. For 
example, AgGp expands to p, AsY,A@Gp, and Ag Y,AeGp 
for each m. For each unique interference (m, Ym, Y/,) and 
applicable AND-node c, we create an OR-node successor dm. 
The label of dm contains formulae in Y/,, CA), and values 
of variables in L(c) that are not shared with m. 

In addition to that, we also create an OR-node successor 
of c for each EgXmg € L(c). These successors capture the 
changes to shared variables as well as the satisfaction of 
existential next-time properties. For a given Y,,, consider the 
set of Y/, such that (m, Ym, Y,/,) is a tuple. Those Y/, form 
the possible interference to shared variables. The changes to 
Ym are translated into a disjunctive formula. Each change is 
represented as a conjunct of values of variables in Y/,. For 
each g € CEm and applicable AND-node c, we create an 
OR-node dm, and L(dm) contains g, the disjunctive formula, 
formulae in C'A,,, and values of variables in L(c) that are not 
shared with m. For each newly created node dm, we connect 
c to dm by an edge labeled m and merge dm if duplicated. 

Figure 2 is an example of adding bricks to a given 
AND-node (n-successor nodes are omitted from the figure). 


AND-node: 

a,b,c, 

EgGa, Ey X, Eo Ga, EXE Ca 
AgGc, Ag Y, A@Gc, ApY,,AgGc 


m Pe “Som 
k + a 
OR-node: OR-node: OR-node: 
=a, b, a, 1b, EgGa, 
AGC, AgGe, (na A b) V (a Aab), 
c c A@Gc, 


Cc 


Fig. 2. The interference transitions and m-successors of an AND-node. 


In this example, a and b are two external variables shared 
between n and m, and c is an internal variable of n. Suppose 
m interferes with n only by changing (a,b) to (a,b) or 
(a, 7b). The disjunctive formula representing changes of (a, b) 
is (~a Ab) V (a ^A =b). Property EeGa is propagated to exactly 
one m-successor, and Ag Gc is propagated to all m-successors. 
The propagation is done through blue formulae in the figure. 

Finally, for each newly added OR-node d, we create descen- 
dants of d that are reachable via n-transitions. The construction 
terminates when there are no more leaf nodes. The size of 
the resulting tableau 7,’*! is greater than or equal to the 
size of 7,’. We repeat these steps until no more transitions 
or nodes can be added, i.e., when 7,’*! = 7,'. The resulting 
tableau captures all the changes to values of shared variables 
by neighboring processes as interference transitions. 

Based on the fairness constraint, interference transitions will 
eventually be executed. At each AND-node c where the value 
of shared variables between n and m is represented by Ym, we 
need to distinguish between two cases: (1) m changes Ym to 
Y,,, through a (stuttering) transition such that Ym 4 Y/,, and 
(2) m keeps Y,, unchanged in a fair cycle. The first case was 
captured as interference transitions and the second as implicit 
self-loops. However, if both cases happen at the same Ym, the 
corresponding node c should have the interference transition 
indicating the change as well as an explicit m-labeled self- 
loop indicating the choice of ‘remaining unchanged forever’. 
We add these self-loops to applicable AND-nodes in 7,* by 
using dummy nodes. 

When no more transitions can be added, the tableau has 
reached its fixpoint, 7,*, and the construction terminates. 


C. Tableau Pruning 


The goal is to construct a model H% such that P, is 
outward-facing in H>. Since H% is extracted from the pruned 
T,, we added a restricted outward-facing assumption to only 
focus on tableaux where all the encoded models are outward- 
facing. For each neighbor m and each set of values of 
shared variables Ym, the restricted outward-facing assumption 
requires the representative n to make the same set of changes 
to the shared state Y,,, no matter which AND-node child is 
selected to be in the model. This guarantees a strictly stronger 
form of the outward-facing property. 
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TABLE II 
THE DELETION RULES FOR TABLEAU PRUNING. 


deleteP Delete any node whose label is propositionally inconsistent. 

deleteOR Delete any OR-node all of whose successors are deleted. 

deleteAND Delete any AND-node one of whose successors is deleted. 

deleteEU Delete any node v if Eg(fUg) € L(v), and there does not 
exist an AND node c’ reachable from v through a finite 
path m, such that g E€ L(c’) and f € L(c) for all AND- 
nodes c on 7 except c’. 

deleteAU Delete any node v if Ae(fUg) € L(v), and there does 
not exist a subdag U rooted at v such that g € L(c’) for 
all leaf nodes c’ in U and f € L(c) for all internal AND- 
nodes c of U. 

deleteEG Delete any node v if EgGg € L(v), and there does not 


exist a fair full path 7 starting at v such that g € L(c) for 
all nodes c on v. 

Delete any AND-node cn if every AND-node cm of a 
neighbor m that forms a joint state (Cn, Cm) is deleted. 


deleteJoint 


Before pruning, we verify the assumption on 7,* and 
terminate the synthesis procedure if the assumption is violated. 
(Relaxing the assumption and finding an outward-facing model 
from any tableau is a future research direction.) 

Similar to the classical approach, pruning tableau 7% is 
done by deleting inconsistent nodes. As shown in Table II 
(c.f. [4]), an additional rule deleteJoint is added. DeleteJoint 
deletes any AND-node in 7,* that fails to form joint states 
with neighboring isomorphic tableau, 7,%. For each neighbor 
m and each value of shared variables Y,„, let CY be a set of 
AND-nodes of n such that Ym C L(c,) for each node cn in 
the set. Let CY be a set of AND-nodes of m such that each 
Cm € CY, forms joint states with the nodes in CY. If all the 
isomorphic AND-nodes bn of €m in oY, are deleted, we delete 
all the AND-nodes c,, in CY because the joint states of Y,, 
no longer hold, and vice versa. 


The pruning process eventually terminates because the num- 
ber of nodes in 7% is finite. Upon termination, if the root of the 
tableau is deleted, then y,, is not satisfiable by our procedure. 
Otherwise, we extract a model Hž from the pruned tableau. 


D. Extraction of a Model 


We reuse the existing procedure in [8], [6] to ‘unravel’ the 
pruned tableau 7,* into a model. 

For each AND-node c in 7%, we construct a fragment of 
c following the standard tableau approach. The structure of a 
fragment is taken from 7%. All nodes in a fragment are AND- 
nodes. Nodes s and t are connected with a directed edge in 
a fragment if there exists transitions (c, d), (d,c’) € R in T,*, 
such that s and t are copies of c and c’, respectively. The 
fragment of c certifies the fulfillment of all eventualities in 
L(c). When it comes to universal eventualities like As(fUg), 
if there are multiple subdags in the tableau, we choose the one 
with the least number of unfair cycles. 

A model Hž is formed by connecting fragments together 
following the standard tableau approach. The process Pp is 
obtained from H; by removing the interference transitions. 


E. Soundness and Complexity 


Theorem IV.1. Soundness. If a labeled transition system H% 
is constructed from pn, then H} satisfies pn, H;, is closed 
under neighboring interference, and process P„ is outward- 


facing in Hž. 


Proof. During tableau construction, blocks(d) computes suc- 
cessors of an OR-node d, tiles(c) computes n-successors of 
an AND-node c, and bricks(c) computes m-successors of c 
for neighbors m. Based on the constructions of the tableau, all 
formulae in node labels are propagated correctly in the tableau 
of n, including 7,°, any intermediate 7,’, and 7,* (similar to 
the proofs in [6]). For example, As(fUg) in the label of a node 
propagates to successor nodes as either g or f, Ap Yn Ao(fUg), 
and AgY;,Ae(fUg). The propagation continues forever along 
each path until g is reached. 

Since all the nodes in the pruned 7% are consistent, all the 
eventualities in the label of any AND-node in the pruned 7% 
are fulfilled in a fragment rooted at the node. Since H% is 
constructed by concatenating fragments, starting with a root 
that automatically satisfies yn, H% is a model of yp. 

The size of the tableau increases monotonically until it 
reaches a fixpoint, 7,*. Since the size of 7, is bounded, the 
tableau construction eventually terminates at the fixpoint. By 
construction, each intermediate tableau 7,’ fully reflects the 
interference of neighboring isomorphic copies of 7,71. The 
construction continues until no more nodes can be added to 
the tableau. Therefore, 7,* is closed under self-interference. 

Then, we show that the model is also closed under self- 
interference. Based on deleteJoint, in the pruned tableau 7%, 
for each m-labeled transition Y,,, > Y/, and each AND-node 
c whose label contains Ym, c forms joint states with neighbors 
m, and there exists transitions isomorphic to Ym + Y;/, in the 
pruned 7%. On the other hand, in the pruned 7%, the set of 
interference transitions reflects exactly the set of transitions 
labeled with n that change the values of shared variables. 
Based on model extraction, H% is closed. 

Since 7,* satisfies the restricted outward-facing tableau 
assumption, for all the encoded models Hž, process Pp is 
outward-facing in H*. 


Lemma IV.2. Let pn be a local property of n, and O8here 
the set of shared variables in Xn. The size of tableau Tn is 
bounded by exp(|pn| + exp(|58"2"*|)). 


Proof. For each n-successor vp, in Tn, L(Un) C Prop, so the 
number of formulae in L(v,,) is less than or equal to |Prop]|. 
Since duplicate nodes are merged, the number of n-successors 
in Jn is bounded by ea:p(|Prop]). 

As in Section IV-B, an extra disjunctive formula is added 
to the labels of some OR-nodes to represent the interfer- 
ence transitions of neighbors m. Considering binary vari- 
ables, the number of different values of shared variables is 
exp(|XS""¢|). Hence, there are at most exp(exp(|53"2"¢|)) 
different ways related to the presence of a disjunctive formula 
in node labels. Therefore, the number of nodes in 7,, is 
bounded by exp(|Prop|) + exp(exp(|58"2"*|)). Since |Prop| 
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P 


n+l n-1 


Fig. 3. A process model of the mutual exclusion protocol. 


is linear in terms of |y,,|, the number of nodes in 7;, is in 
O(exp(|~n| + exp(|X8"2"¢|))). In applications, exp(|Prop|) 
is more likely to dominate exp(exp(|8""°|)). In most cases, 
the size of 7;, is exponential in the length of the input local 


Fair CTL property pn. 


Lemma IV.3. The cost of constructing P, is in time polyno- 
mial in the size of the tableau. 


Proof. For each node v in tableau 7;,, the sum of the lengths 
of the formulae in L(v) is in O(|pn|?). The cost of computing 
successor for v is polynomial in |y,,|. Fixpoint construction, 
tableau pruning, and unraveling all require time polynomial in 
the size of the tableau. Therefore, the total cost of constructing 
P, is in time O(exp(|yn| + exp(|=S"2"*|))). 


The tableau approach constructs P,, as a template for the 
locally symmetric processes. To deploy the template through- 
out the process network, the subscript indices on all state and 
transition labels are changed accordingly. 


V. APPLICATIONS 


We illustrate our approach with three ring-based protocols, 
namely, mutual exclusion, leader election, and dining philoso- 
phers. Our approach is implemented in Python with the CTL 
module provided in the pyModelChecking API. We tested the 
synthesis procedure on a 2.5 GHz CPU and 16 GB of memory, 
and each ran for 5.3, 297, and 261 seconds, respectively. 
In each case, the procedure converged within three tableau 
iterations. 


A. Mutual Exclusion 


Mutual exclusion is a mechanism that prevents processes 
from accessing a shared resource simultaneously. Globally, the 
mutual exclusion property asserts that no two processes can be 
in the critical section at the same time. Locally, the property 
is achieved through token passing. 

For any K > 2 and a generic n, the external variables tokn 
and tok,+1 are shared with n—1 and n + 1, respectively. The 
internal variable N, stands for non-critical, T, for trying, and 
Cn for critical. We specify Yn as follows. 


Fig. 4. A process model of the leader election protocol. 


e Three initial conditions: Nn A tokn A atokn+1 (n has the 
token), Nn A ntokn A toky+1 (the right neighbor has the 
token), and Nn A atoky, A atoky+1 (no token locally). 

e Local mutual exclusion: AsG(—tok, V atokn+1). 

e Moves of n from non-critical to trying (while keep- 
ing the token) or remains in non-critical (while pass- 
ing the token): AsG((N, A atokn) > (FeXn(Nn A 
atok,) A Ea Xn(Tn A atokn))), AeG((Nn A tokn) > 
(Es Xn(Nn A atokn A tokn+1) A Es Xn(Tn A tokn))). 

e Moves of n from trying to critical with the token: 
AgG((Ty A tokn) > AgYn(Cn A tokn)). 

e Moves of n from critical to non-critical while passing the 
token, AG (Cn = AeYn(Nn A atokn A tokn41)). 

e The liveness property: AeG(T, > Ae FCn) 

e One at atime: AsG(NnVTnV Cr), AaG(Nn => (Thn ^ 
=C,,)), AG(Ta > (Nn ASC,)), and A3G(Cn > 
(Na AT) ): 

Properties that ensure variables remain unchanged are omit- 
ted from the list for the sake of clarity. By induction on the 
size K, assuming the initial condition that exactly one process 
owns a single token, if Yn is true for all processes in a ring, 
then it guarantees that each process eventually gets and passes 
the token, and there is exactly one token (i.e., tokens are not 
generated or lost). Hence, no two processes access the critical 
resource simultaneously. 

Fig. 3 is a model of yn. Rectangles represent local states, 
where yellow corresponds to initial states. Solid arrows are 
transitions by Pp, and dashed arrows are interference tran- 
sitions. Rectangles with red borders are inconsistent states 
because Yn has no information about the initial conditions of 
non-neighboring processes. I.e., in the perspective of n, there 
is at most one token locally, but globally, n does not know. 
Instead of deleting the parents of these inconsistent states 
according to the deletion rules, we manually refine the set of 
interference transitions by taking into account the initialization 
of all processes in the ring. I.e., there is only one token. 


B. Chang and Roberts Leader Election 


Suppose each process has a finite and unique competing 
value (abbr. cv). The goal of the protocol is to select the 
process with the largest cv to be the leader. Globally, the 
correctness of the protocol is specified as a safety property, i.e., 
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there is never more than one leader, and a liveness property, 
i.e., eventually there will be a leader. The specification can be 
written locally from the perspective of a generic n [9]. The 
cu of n may or may not be the greatest on the network. 

Initially, some but not all processes detect the absence of 
the leader, i.e., Pa may become a participant in the election 
and send out an election message containing its cv to the 
right. When P,, receives an election message from its left, 
P„ compares the competing value in the message, denoted 
by cv’, with its own cv. In general, the comparison yields 
three different outcomes, i.e., cu’ > cu, cu’ < cv, and 
cu’ = cv. If cv’ > cv, P, forwards the message to the 
right. If cv’ < cv, P, sends a message of its own cv. If 
cu’ = cv, P, becomes the leader. A non-participant becomes 
a participant after forwarding or sending an election message, 
and a participant no longer sends election messages of its own 
cv. A new leader sends a message to the right to terminate the 
election. Upon receiving the termination message, a process 
becomes non-participant and forwards the message. 

A constructed model H% is shown in Fig. 4. External 
variables b,, and b,,,1 are shared with left and right neighbors, 
representing shared message buffers of size one. Internal 
variables par, denotes that P,, is a participant, l„ denotes 
that P,, is the leader. Comparisons are abstracted into boolean 
variables. When comp, is true indicating a comparison in 
progress, one of the following is true, fa (greater/forward), 
Sn (smaller/send), dn (smaller/discard), e» (equal), and tn 
(election termination). For comparison results other than dn, 
bn+1 becomes true, i.e., a message is sent to the right. 

The global reasoning for this protocol is as follows. Glob- 
ally, there exists one process whose competing value is the 
greatest. Based on the global initialization and local specifi- 
cation, and supposing the message comparison always yields 
correct results, the process with the greatest cu sends and 
receives a message with its own competing value. For all the 
other processes, messages with their competing values will 
not go through the full round of message passing, and these 
messages will eventually be discarded by processes with a 
greater competing value. Therefore, there will eventually be a 
leader and never more than one leader. 


C. Dining Philosophers 


In a standard dining philosopher protocol [10], the internal 
state of P, is one of T, (thinking), H,, (hungry), or E,, 
(eating). Fig. 1 indicates the external variables of n, where 
Tn Means that P,, picks up its left fork, and l„ means that the 
left neighbor picks up the fork. Similarly, ln+ı means that P, 
picks up its right fork, and r„+ı means that the right neighbor 
picks up the fork. The variables r,, and /,, cannot be true at the 
same time, nor can rn+1 and ln+1. Both variables r and l are 
false means the corresponding fork is available. Process Pp 
can read and write r,, and 1,41, but P, has read-only access 
to În and rn+1. 

Process P,, can stay in thinking or move to hungry at any 
time, and P, in its hungry state picks up available forks. 
While holding both the left and the right forks (i.e., rn Aln+1), 
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Fig. 5. A process model of dining philosophers. 


P,, should enter into the eating state. After eating, Pa goes 
back to thinking and returns the forks (i.e., =rn A —l,41). The 
specification guarantees that no two neighboring processes are 
eating simultaneously. 

Fig. 5 shows a model of pn. Adding a liveness property, 
AgG(H, = A FEn) would make p, unsatisfiable. Livelock 
and starvation are possible and are observed locally in the 
model. The unsatisfiability of appending the liveness property 
to Yn does not mean there is no local solution to the dining 
philosopher problem. On the contrary, the problem can be 
solved using acyclic precedence graphs as in [10] (i.e., by 
modifying Yn and introducing more variables). 


VI. RELATED WORK AND CONCLUSION 


In this paper, we reduce the synthesis problem for a 
parameterized protocol to the problem of synthesizing a 
representative process that meets a local specification under 
interference from neighboring copies of itself. The algorithm 
runs in time exponential in the length of the local property, 
which is expressed in Fair CTL and may include safety as 
well as liveness aspects, using both universal and existential 
path quantification. The approach is incomplete and not fully 
automated, but it succeeds on several interesting cases. 

The novelty is in our solution to the new ‘self-referential’ 
synthesis question. Our tableau construction builds on the 
classical one of [8] for CTL and that of [4] for Fair CTL. 
These constructions work in closed synthesis settings where 
the environment is assumed to be cooperative. A fully open 
synthesis procedure was devised for LTL in [11]. In our case, 
the environment is formed of copies of the unknown to-be- 
synthesized process, which is an open synthesis problem of a 
special type. 

The work relies on the compositional inductive invariant 
under local symmetry given in [12], [13], and [2]. We capture 
the behaviors of a representative in its neighborhood as a 
fixpoint tableau. Other work related to inductive invariants 
(c.f. [14]) uses similar fixpoint characterizations to compute 
thread-modular rely-guarantee assertions under abstractions. 
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Synthesis of a distributed system is undecidable, even with 
a fixed number of components [15]. Decidable architectures 
are known [16] as are decision procedures (c.f. [17], [18]), 
but the complexity is exponential or even nonelementary in 
the number of processes. In contrast, our procedure produces 
a representative process that is replicated to form arbitrary-size 
instances, so its complexity is independent of the instance size. 

Reduction or generalization theorems are also central to 
prior work on parametrized synthesis. In [5] representative 
processes are constructed from synthesis of pair-systems. The 
paper [19] decides ‘almost always satisfiablity’ for indexed 
but restricted CTL properties. Cutoff results for parametrized 
verification are applied in [20] to synthesize ring protocols; 
however, the dining philosophers and leader election examples 
fall outside the class for which cutoffs are known. The 
paper [21] takes an automata-theoretic approach to rotation- 
symmetric architectures. Synthesis of symmetric processes in 
self-stabilizing parameterized unidirectional rings is explored 
by [22]. The paper [23] focuses on round-bounded parameter- 
ized systems. 

The different approaches that exploit symmetry in the 
system structures make use of a kind of global symmetry 
c.f. [24], [25], and [26]. In contrast, the work presented in 
this paper relies on notions of local symmetry as introduced 
n [12], [13], and [2]. The differences are important because 
local symmetry properly generalizes ‘global symmetry,’ often 
allowing for exponentially more reduction, for instance in 
the case of ring architectures. Our work here is the first to 
show how the notation of local symmetry can be used to 
form the basis of a synthesis procedure whose output is a 
single representative that can be deployed across all network 
instances in the parametric family of networks. 

The reduction theorem on which the work in this paper is 
based is of an assume-guarantee type. Existing formulations 
of assume-guarantee synthesis (c.f. [27], [28]) however do not 
allow for the self-referential form of synthesis that is required 
by the reduction theorem. 

We are currently working on applications to other protocols, 
including those with several representative processes, to fault 
tolerant protocols [6], and towards relaxing the outward-facing 
assumption. 
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Abstract—The focus of this paper is on the synthesis of 
unidirectional symmetric ring protocols that are self-stabilizing. 
Such protocols have an unbounded number of processes and 
unbounded variable domains, yet they ensure recovery to a set 
of legitimate states from any state. This is a significant problem 
as many distributed systems should preserve their fault tolerance 
properties when they scale. While previous work addresses 
this problem for constant-space protocols where domain size of 
variables are fixed regardless of the ring size, this work tackles 
the synthesis problem assuming that both variable domains and 
the number of processes in the ring are unbounded (but finite). 
We present a sufficient condition for synthesis and develop 
a sound algorithm that takes a conjunctive state predicate 
representing legitimate states, and generates the parameterized 
actions of a protocol that is self-stabilizing to legitimate states. 
We characterize the unbounded nature of protocols as semilinear 
sets, and show that such characterization simplifies synthesis. 
The proposed method addresses a longstanding problem because 
recovery is required from any state in an unbounded state space. 
For the first time, we synthesize some self-stabilizing unbounded 
protocols, including a near agreement and a parity protocol. 

Index Terms—Parameterized Systems, Synthesis and Verifica- 
tion, Self-Stabilization 


I. INTRODUCTION 


This paper investigates the problem of synthesizing Self- 
Stabilizing unidirectional Symmetric ring protocols with Un- 
bounded number of processes and unbounded variable do- 
mains, called SS-SymU protocols (a.k.a. unbounded uni- 
rings). A process contains a set of atomic actions. When an ac- 
tion of a process is executed, it is disabled until enabled again 
by the neighborning processes; i.e., self-disabling actions. In a 
symmetric ring, the actions of each process are generated from 
a template process by a simple variable re-indexing. A self- 
stabilizing protocol automatically recovers (in a finite number 
of steps) to a set of legitimate states Z from any arbitrary state 
[1]; i.e., all states are initial states. Such recovery should be 
achieved without the intervention of a central authority. The 
significance of this synthesis problem is multi-fold. First, while 
uni-ring is a simple topology, it is of practical importance 
in distributed systems where the underlying communication 
topology may include cyclic structures. Second, the unbound- 
edness of the ring size and variable domains is a requirement 
where networks scale up and buffer sizes grow. The elegance 
of many distributed protocols/algorithms (e.g., logical clocks 
[2], Dijkstra’s token passing [1], unbounded registers [3]) 
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is due to the assumption of unbounded variable domains 
and processes, which makes it significant to develop tools 
that can synthesize such protocols under the unboundedness 
assumption. Third, self-stabilization is an important fault tol- 
erance property that enables decentralized recovery in the 
presence of transient faults, which perturb the system state 
without causing permanent damages. While previous work 
[4], [5], [6], [7], [8] addresses the verification and synthesis 
of parameterized symmetric uni-rings, the domain size of 
variables remains constant regardless of the ring size. To the 
best of our knowledge, this paper presents the first method 
for the synthesis of SS-SymU protocols that are unbounded in 
terms of both the number of processes and variable domains. 

Most existing methods for the synthesis of self-stabilizing 
protocols either focus on fixed-size protocols or consider an 
unbounded number of processes only; variable domains are 
considered bounded. For example, specification-based meth- 
ods [9] compose a pair of template processes to reason about 
the global safety and local liveness properties of parameterized 
synchronization skeletons. Methods for fixed-size synthesis 
[10], [11], [12], [13] consider a fixed upper bound & on the 
number of processes, and generate a solution that is correct 
up to k processes. To enable the synthesis of parameterized 
self-stabilizing systems where solutions work for an arbitrary 
number of n processes, some approaches rely on parameter- 
ized synthesis [14] where an implementation is generated for a 
parameterized specification and a parameterized architecture. 
Such methods employ bounded [15] and SMT-based [11] syn- 
thesis to show the correctness of a solution with cutoff number 
of processes, where a solution exists for a protocol with cutoff 
number of processes iff (if and only if) a solution exists 
for the parameterized protocol with unbounded number of 
processes. Other methods [7] present cutoffs for the synthesis 
of self-stabilizing protocols in symmetric networks, however, 
such cutoffs can be quadratic/exponential in the bounded 
variable domains depending on the structure of Z. Synthesis 
of parameterized systems with threshold guards [4] starts with 
a sketch automaton (whose transitions have incomplete guard 
conditions capturing the number of received messages), and 
complete the guards towards satisfying program specifica- 
tions. Our previous work [5] addresses the synthesis of self- 
stabilizing parameterized protocols where the local state space 
of the template process remains constant. 
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Contributions. In contrast to most existing methods, we 
propose a novel approach based on the synthesis of semilinear 
sets in the unbounded local state space of the template process 
of SS-SymU for conjunctive predicates. Specifically, we start 
with a global state predicate Z = Vi € N :: L(xi—1, £i) where 
L(x;-1,2;) denotes a local state predicate of the template 
process P; and x; is an abstraction of the local state of P;. We 
then generate a protocol that self-stabilizes to Z regardless of 
network size and the domain size of variables. Domain size is 
of particular importance as some protocols may not exist for 
specific domain sizes (e.g., Dijkstra’s token ring [1] requires 
a domain size of at least N — 1 in a ring of N processes). 
We utilize necessary and sufficient conditions identified in [5], 
[6] for the livelock-freedom of a solution with constant-space 
processes in order to impose a structure on the unbounded 
transition system of the template process. Such conditions 
require the existence of a value y in the domain of x; for which 
L(y,7) holds. Moreover, necessary and sufficient conditions 
for livelock-freedom (under an unfair scheduler) require a tree- 
like structure rooted at y for the local state transition system 
of the template process. While these results are for constant- 
space processes, we generalize them for unbounded domain 
sizes. Specifically, we show that if the state transition system 
of the template process is a semilinear set represented as an 
infinite tree rooted at y, then a solution exists. A semilinear 
set is the finite union of a set of linear sets, where a linear 
set contains periodic integer vectors. Based on this sufficient 
condition, we develop a sound algorithm that takes L(a;_1, £i) 
and generates the periodic linear sets of a semilinear set in a 
way that their vectors are organized in a potentially infinite 
tree rooted at y. Each synthesized linear set represents the 
unbounded structure of a protocol action. We then use such 
linear sets to synthesize the parameterized actions of a protocol 
that self-stabilizes to Z for unbounded number of processes 
and unbounded domain sizes. We demonstrate the proposed 
method using a near-agreement and a parity protocol. 
Organization. Section II provides some basic concepts. Sec- 
tion III presents the proposed synthesis method. Section IV 
demonstrates the application of the synthesis method for a 
parity protocol. Section V discusses related work. Section VI 
makes concluding remarks and discusses future research. 


II. PRELIMINARIES 


This section represents the definition of state predicates, 
parameterized protocols and their representation as locality 
graphs (adopted from [16], [17], [5], [6]), and semilinear 
sets. We use the term parameterized protocol to refer to uni- 
ring symmetric protocols that have both unbounded number 
of processes and unbounded variable domains. A protocol p 
includes N > 1 symmetric processes on a uni-ring, where the 
code of each process is derived from the code of a template 
process P; by variable re-indexing. The template process P; 
has a variable x; whose domain abstracts the set of valuations 
to all writable variables of P;. The domain of x;, denoted 
M = Dom(z;), can be unbounded (but finite). Any local state 
of a process (a.k.a. localitymeighborhood) is determined by a 


unique valuation of its readable variables. We assume that any 
writable variable is also readable. Network topology defines 
the set of readable variables of a process. For example, in a 
uni-ring consisting of N processes, each process P; (where 
i € Zy, i.e., 0 < i < N — 1) has a predecessor P;_1, where 
subtraction is in modulo N. That is, P; can read the values 
of x; and x;_1, but can update only x;. The global state of 
a protocol is defined by a snapshot of the local states of all 
processes. The state space of a protocol p, denoted by &,, is 
the universal set of all global states of p. A state predicate 
is a subset of ,. A process acts (i.e., transitions) when it 
atomically updates its state based on its locality. 


We assume that processes act one at a time (i.e., interleav- 
ing semantics). Thus, each global transition corresponds to 
the action of a single process from some global state. An 
execution/computation of a protocol is a sequence of states 
S0, S1,- .-, Sk Where there is a transition from s; to s;4, for 
every i € Zp. The transition function ô : Xp x Xp —> Ly of the 
template process captures its set of actions £i—-1 =aA a= 
b —> x; := c, which can also be captured as triples of the 
form (a,b,c). That is, d(a,b) = c iff Gf and only if) P; has 
an action 7;_1 = a ^A zi = b— ax; := c. An action has two 
components; a guard, which is a Boolean expression in terms 
of readable variables and a statement that atomically updates 
the state (i.e., writable variables) of the process once the guard 
holds; i.e., the action is enabled. Previous work [18] shows that 
assuming self-disabling and deterministic processes simplifies 
synthesis without undermining soundness and completeness. 
An action (a, b, c) cannot co-exist with action (a, c, d) in a self- 
disabling process for any d. A deterministic process cannot 
have two actions enabled at the same time; i.e., an action 
(a,b,c) cannot co-exist with an action (a,b,d) where d # c. 


Definition II.1 (Action Graph). For a fixed domain size M, 
we can depict the set of actions of the template process of 
a symmetric uni-ring by a labeled directed multigraph G = 
(V, A), called the action graph, where each vertex v € V 
represents a value in Zm, and each arc (a,c) € A with a 
label b captures an action z;—-1 = 4 A £i = b — Ti := C. 


For example, consider the Parity protocol introduced in [6]. 
Each process P; has a variable x; € Z3 (i.e., M = 3) and 
actions 7;_, = OAx; = 1 — z; := 0, aj) = LAz; = 2 — 
xi := 0, and z;—-1 = 2 ^A xz; = 1 — zx; := 0. This protocol 
ensures that, from any global state of a symmetric uni-ring, a 
state is reached where processes agree on a common odd/even 
parity. We formally specify these states as the state predicate 
Tpar = Vi € Zy : ((|£i-1 — zi| mod 2 = 0). Throughout 
this paper, the subscript operations are modulo number of 
processes, and the arithmetic operations in the state predicates, 
and in the guard and assignment of actions are performed 
modulo M. Figure 7b illustrates this protocol as an action 
graph containing arcs (0,1,0), (1,2,0), and (2,1,0). 


Definition II.2 (Self-Stabilization and Convergence). A pro- 
tocol p is self-stabilizing [1] to a state predicate Z iff from 
any state in =Z, every computation of p reaches a state in 
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T (i.e., convergence) and remains in Z (i.e., closure). A state 
predicate Z is closed in p iff there is no transition (s, s’), 
where s € Z and s’ ¢ T. Convergence of p to T requires that 
p does not reach a deadlock, nor does it reach a livelock in 
=Z. A deadlock state is a global state where no process has 
any enabled action. A livelock is an infinite cyclic computation 
l= (80, $1,°*+ , So), where s; is a global state, for i > 0. 


Definition II.3 (Locality Graph). Consider a global state 
predicate ZT = Vi € Zy : L(xi—1, xi) for a protocol, and a 
domain size M. The local predicate L(x;—1, xi) captures a set 
of local states, representing an acceptable relation between the 
states of each process P; and the states of its predecessor P;_1. 
We represent L(2;_1, xi) as a digraph G = (V, A), called the 
locality graph, such that each vertex v € V represents a value 
in Zm, and an arc (a,b) is in A iff L(a,b) holds. 


Figure 7a illustrates the locality graph of the Parity protocol 
introduced in this section for M = 3 and the state predicate 
L(xi—1, xi) = ((\ai-1-—2;| mod 2) = 0). We have extensively 
studied [5], [6] the use of locality and action graphs in 
local reasoning about global properties (e.g., livelocks). Our 
previous work [17], [5] investigates the following synthesis 
problem, whereas in Section II we solve this problem when 
its assumption is lifted. 


Problem II.4 (Synthesis of Symmetric Uni-Rings). 


e Input: L(xi—1, £i), and the domain size M of 2;. 

e Output: The transition function ô (represented as an 
action graph or parameterized actions) of a protocol p 
such that the entire ring is self-stabilizing to Z = Vi: i € 
Zy : L(ai-1,2;) for any ring size N > 3. 

e Assumption: M is fixed regardless of the ring size N; 
i.e., p has constant-space processes. 


The following theorem (proved in [17], [5]) provides the 
foundation of a synthesis method for parameterized uni-rings 
with constant-space processes. In the rest of this section, we 
present an overview of the synthesis method of [5] since its 
knowledge is required for our exposition. 


Theorem II.5. There is a symmetric uni-ring protocol p (with 
deterministic, self-disabling and constant-space processes) 
that self-stabilizes to T = Vi € Zy : L(ai-1,2;) for an 
unbounded (but finite) number of N processes iff there is a 
vertex y in the locality graph G of L(xi—1, £i), where L(y, 7) 
holds, and the action graph of p is a directed spanning tree 
of G, sinking at y as its root [17], [5]. 


Algorithm 1 (introduced in [5]) takes as input the local 
predicate L(x;_1,2;) and generates the set of parameterized 
actions of a self-stabilizing uni-ring protocol. For example, 
Step 1 takes the local predicate (|a;_1 — x;| mod 2 = 0) of 
Tpar in Parity with domain size 3, and initially generates its 
locality graph illustrated in Figure 7a. This occurs because 
there is some y for which L(y, y) holds. Selecting y as 0, 
Algorithm 2 generates the spanning tree of Figure 7b in Step 
3 (excluding the labels). Notice that, the output of Algorithm 


2 is a spanning tree over the vertices of the locality graph 
of L(x;_-1,2;) rooted at y, including a self-loop on y. Step 
4 of Algorithm 1 then includes the arc labels, where a value 
b becomes a label for an arc (a,c) iff aL(a,b) A (b # ©). 
For example, when labeling the arc (0,0) in Figure 7b , 
a = 0, and the algorithm looks for any value b in Z3 such 
that (|0 — b| mod 2) # 0 modulo 3. For M = 3, the value 
b = 1 is the only acceptable label. 


Algorithm 1. SynUniRing(L(x;—1, xi): state predicate, M: 
domain size) 
1: Check if a value y € Zm exists such that L(y, y) = 
true. 
2: If no such y exists, then return Ø and declare that no 
solution exists. 
3: T := ConstructSpanningTree(L(«;-1, £i), M, Y). 
4: Transform 7 into an action graph of a protocol by the 
following step: 
For each arc (a,c) in T, where a,c € Zm, 
label (a,c) with every value b € Zm for 
which L(a, b) = false and b  c hold. 
5: Return the actions represented by the arcs of 7. 
end 


Algorithm 2. ConstructSpanningTree(L(x;~1, xi): state pred- 
icate, M: positive integer, y € Zm) 
1: Construct the locality graph G = (V, A) of L(xi—1, £i) 
for domain size M. 
2: Induce a subgraph G” = (V’, A’) that contains all arcs 
of G that participate in cycles involving y. 
3: Construct a spanning tree 7 rooted at y for G’. Use 
backward reachability to construct the spanning tree. 
4: For each node v € G that is absent from G”, include an 
arc from v to the root of 7. The resulting graph would 
still be a tree, denoted 7’. 
5: Include a self-loop (7, y) at the root of 7’. 
6: Return 7’. 


end 


Theorem II.5 explains why Algorithm 2 includes a self-loop 
at the root y (in Step 5). Moreover, the reason why Algorithm 
1 constructs a spanning tree is to ensure deadlock and livelock- 
freedom. We have shown [5] that the existence of such a 
spanning tree is necessary and sufficient for convergence to 
T in symmetric uni-rings with constant-space processes. 


Definition II.6 (Vector). A vector of dimension d > 1 of 
non-negative integers is a tuple (a;,a2,--- ,aq) E N4, where 
a; E N for 1 <i < d, and N denotes the set of non-negative 
integers. 


Definition II.7 (Linear Set). Any non-empty subset of N? is 
linear [19] if it can be represented as a periodic set of vectors 
L = {v tO AG: pi Ai E N}, vy € Nf is the base vector 
and {p1, +> ,Pn} C N@ is a finite set of period vectors. 


For example, a singleton set Lı = {(5,7)} is linear (with 
dimension d = 2) because the base vector is (5,7), and there 
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is a unique period vector (0, 0). Moreover, the linear set £2 = 
{(3, 2), (4,3), (5, 4),--- } has a base vector (3, 2) and a period 
vector pı = (1,1). That is, L2 = {vp + Api : A € N}, where 
w = (3,2),n = 1,d = 2, pı = (1,1), and AEN. 


Definition II.8 (Semilinear Set). A semilinear set [19] is a 
finite union of some linear sets. Semilinear sets provide a finite 
representation for finite and infinite subsets of N4. 


Ginsburg and Spanier [20] show that semilinear sets capture 
the sets of integers that are definable in the first-order theory 
of integers with addition and order; i.e., Presburger arithmetic. 
Semilinear sets are closed under Boolean operations [20]. 


II. SYNTHESIS METHOD 


This section first presents a sufficient condition for the 
existence of a SS-SymU protocol, and then provides a sound 
algorithm for generating such protocols. We use the Near 
Agreement (NA) protocol as a running example to ease the 
presentation of this section. 

Problem Statement. We solve Problem II.4 without its as- 
sumption of constant-space processes; i.e., processes have 
unbounded state spaces due to unbounded variable domains. 

Example: Near Agreement (NA) Protocol. A node P; in a 
ring of N symmetric nodes nearly agrees with P;_, iff 
(aj-1 = aj) V (aj-1 = 2; +1), where subtraction is in 
modulo N and addition is done modulo M. Thus, the entire 
ring should self-stabilize to Zya = Vi € N :: L(ai-1, 2%), 
where L(zi—1, zi) = (£i—1 = Ti) V (£i—1 =e F 1). Figure 
3a illustrates the locality graph of L(xi—1, xi) for M = 3. Our 
objective is to synthesize an NA protocol that is self-stabilizing 
regardless of the number of processes and the domain size M. 


A. Sufficient Condition for Solvability 


Since Algorithm 1 is a sound and complete algorithm for 
any fixed domain size M, one can enumeratively increase 
the domain size and utilize Algorithm 1 to generate a self- 
stabilizing protocol for each particular M. However, such an 
approach would not bear fruit for unbounded domain sizes 
unless we can ensure that the structure of the spanning tree 
(and in turn the action graph) that Algorithm 1 generates for 
M, will be inductively preserved for M +1 and beyond. This is 
a challenge because when the domain size increases to M +1, 
the locality graph of L(xi—1, xi) may be totally different. For 
example, observe how the locality graphs in Figures 2a and 3a 
change when M is increased from 2 to 3 for the NA protocol. 
To ensure that the spanning tree’s structure would be preserved 
when domain size increases, one approach is to keep the arcs 
of the spanning tree Tm for domain size M, and systematically 
include one more arc (a, a’) in Ty to derive another spanning 
tree Tm+1 for the domain size M + 1. In turn, expanding 
the domain of x; from M + 1 to M + 2 should ensure that 
TM+2 preserves all arcs of T41 and includes an additional arc 
(b, b’) through some function f such that f[(a,a’)] = (0,0) 
and b = M +1 modulo M +2. Moreover, if f[(b, b’)] = (c, c’) 
when the domain size increases to M + 3, then c— b =b—a 
and c’ — b = b' — a’ must hold. That is, the growth of the 


spanning tree must be periodic. Moreover, the root remains to 
be y. If such conditions are met, then for any domain size M, 
the conditions of Theorem II.5 hold. Since the vertices of the 
spanning tree are non-negative integers, each arc (a,b) in a 
tree is an integer vector. As such, the vector (a, a’) would be 
the base vector of a linear set and (b — a,b! — a’) gives the 
period vector of that linear set. Each one of the arcs in the first 
tree Tm for the initial domain size M would also form a finite 
linear set. Therefore, the arcs of the unbounded spanning tree 
would form a semilinear set. 


Theorem I.1. Let T = Vi € N :: L(aj_-1,2;), and let 
there be a value y for which L(y, y) holds starting from some 
domain size M onward. If the arcs of the y-rooted spanning 
trees built for each domain size k > M represent the periodic 
growth of a semilinear set, then there is a symmetric uni-ring 
protocol that self-stabilizes to T regardless of the ring size and 
domain size. (Proof is due to Algorithm 3 and its soundness.) 


B. Overview of the Synthesis Method 


An implication of Theorem II.1 is that we no longer have 
a finite spanning tree. Instead, we have an unbounded set of 
spanning trees 7),71,--- as the domain size M grows. Put 
it another way, for an unbounded domain size, we have an 
unbounded spanning tree that has an unbounded branching 
factor, or an unbounded depth (or both). How do we formally 
represent such unbounded structures to facilitate the synthesis 
of actions? Theorem II.1 points us to semilinear sets. For 
example, Algorithm 1 generates the tree in Figure 2b for 
the NA protocol and domain size 2, whose arcs represent a 
set of integer vectors {(1,1), (0,1)}. Likewise, the trees in 
Figures 3b to 5b respectively capture these three sets of inte- 
ger vectors: {(1, 1), (0, 1), (2, 1)}, {(1, 1), (0, 1), (2, 1), (3, 2)} 
and {(1, 1), (0, 1), (2, 1), (3, 2), (4,3)} for domain sizes 3 to 
5. The vectors (1,1) and (0,1) exist in the intersection of 
all four sets and will be there for larger domain sizes too. 
We call this set of vectors the common core, denoted C. 
The remaining vectors can be generalized as the linear set 
UC = {(2,1), (3,2),--- } with the base vector (2,1) and the 
period vector (1,1). We call the linear set UC the unbounded 
core of the protocol. Since the common core is finite, each 
vector in it can be represented as a linear set. Thus, we first 
generate the linear sets of a semilinear set that represents the 
unbounded spanning tree of a protocol (Figure 1). Then, we 
synthesize the parameterized action from linear sets. 


C. Generating Linear Sets 


This section presents an algorithm for the generation of a 
semilinear set representing the unbounded spanning tree of a 
protocol. This problem is divided into the formal specifica- 
tions of the common and unbounded cores of a protocol as 
linear sets. A tree is acceptable as long as it has a vertex 
corresponding to each value in a domain size M and its root 
is a value y € Zm for which L(y,7) holds. Algorithm 3 
generates the linear sets of an unbounded tree as Presburger 
formulas. Naturally, we start with the domain size of 2. Steps 2 
and 3 of Algorithm 3 search for a value y for which L(+, y) 
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Fig. 1: Overview of the proposed synthesis method. 


holds for two consecutive odd and even domain sizes. This 
search continues up to a preset upper bound $8. Without such 
an upper bound, the algorithm may never terminate. Step 4 
invokes Algorithm 2 for the construction of a spanning tree 
for M and y found in Step 3. The common core C (see 
Step 5) then includes the integer vectors corresponding to the 
arcs of the spanning tree 7 built in Step 4. After forming the 
common core, Algorithm 3 increases the domain size in Step 
6. Such an increase introduces a new value in the domain of x;, 
denoted v, which corresponds to a new vertex added to 7. To 
determine how va, should be included in the tree, Algorithm 
3 identifies the set U of all vertices u for which L(vm, u) 
holds. We ignore the arcs L(u, vm) because connecting any 
non-leaf node to vjy creates a cycle in the tree. Moreover, 
connecting a leaf node l to vj, would result in two parents 
for l. Thus, the only option for connecting vj, to the tree is 
to include an outgoing arc from vm to some other tree node. 
If the set U is empty (Step 7), then vm is directly connected 
to the root 7; i.e., an arc (vas, ) is included in 7. In this 
case, we consider (vm, y) as the base vector of a linear set 
and (1,0) as the period vector. Such a linear set captures the 
unbounded growth of the domain size as new arcs connected to 
the root. That is, the root y would have an unbounded number 
of children. If U is non-empty (Step 8), then a value w € U is 
randomly selected to be the parent of vm in the tree; i.e., the 
arc (vm, w) is included in the tree. Every time the domain size 
increases, the value of vwm is incremented. For this reason, the 
first element of the period vector must be 1. For simplicity, we 
consider the growth of w in an incremental fashion too. That 
is, the period vector is (1,1) and the base vector is (vm, w). 
Overall, Steps 7 and 8 determine the values of the base vector 
(b, b’) and the period vector (p, p’) of the unbounded core. 


Algorithm 3. Gen_LinearSets(L(xi—1, zi): state predicate, 
B: positive integer) 


1: M :=2. 

2: If M > B then declare that y could not be found and 
exit; // Upper bound reached. 

3: If there is a solution for some value y where L(y, y) 
holds modulo M and M + 1, then go to Step 4; 


otherwise, M := M + 1 and go to Step 2. 

4: T := ConstructSpanningTree(L(x;-1,2;), M, Y). 

5: C := S, where S, represents the set of arcs of 7 as a 
set of integer vectors. // The common core detected 

6: M’ := M +1 and let vm denote the new vertex (i.e., 
value M modulo M’) due to domain size increase. 
Calculate the set U = {u | L(vm, u) holds }; 

7: If U = @ then include arc (vjz,y) every time the 
domain is increased. Set the base vector to (vm, y), and 
the period vector to (1,0). Thus, (b, b’) := (vm, y), and 
(p, p’) := (1,0). / Unbounded core UC. 

8: Else select an arc (vm, w) for some value w € U as 
the base vector. Set the base vector to (vm, w), and the 
period vector to (1,1). Thus, (6,b') := (vm, w), and 
(p, p’) := (1,1). / Unbounded core UC. 

9: For each integer vector (c,d) € C, return formulas 


o(a:-1) = (a-1 = ¢), b(@i-1, 2%) = (x; = d), and 


10: Corresponding to the unbounded core U/C constructed 
in Steps 7 and 8, return formulas ¢(2;_1) = (a1 = 
b-+-Ap), V(ai—1, 2) Ž (z! = zi-1+(b'—b)+A(p'—p)), 
and Ys, (xi—1) = (a1 + (b' — b) + A(p' — p)), where 
AEN. 


end 


Steps 9 and 10 specify the linear sets corresponding to 

the common core and the unbounded core as Presburger 
formulas [20]. Each integer vector (a,b) in a linear set 
actually represents an atomic action of the protocol specified 
as ti-i =a C(ai-1, £i) —> Ti := b, where C (£i—1, £i) isa 
Boolean expression specified in terms of x; and z;—1. Since 
the second element of each vector (a, b) represents the updated 
value of x;, we use the notation x, instead of x; when formally 
specifying the linear sets of a semilinear set. For example, 
we specify the linear set {(0,1)} as a1 = OAa = 1. 
Each such formula provides an incomplete sketch of an action, 
which should be completed in subsequent steps of synthesis. 
In general, we specify a linear set £ with the base vector 
(b, b') and the period vector (p,p’) as {(xi-1, £4) | VA EN:: 
(aj-1 = b+ Ap) A (a = b + Ap’)}. Since x;_1 and zx; 
are free variables and À is known to be a natural value, we 
eliminate the quantifications in Steps 9 and 10 of Algorithm 
3. Let Fy = (aj-1 = b + Ap) and Fg = (x) = b + Ap’). 
Subtracting Fı from Fə relates x; with x;_1 as (aj, x) = 
x; = aj-1 + (b — b) + A(p’ — p) (Step 10). Factoring out x4, 
we get Wz! (Xi-1) = (xi + (b — b) + A(p' — p)). In fact, 
We (a;-1) represents the expression that should be assigned 
to x; in the action corresponding to the linear set £. 
The NA protocol. Figures 2a and 2b respectively represent the 
locality graph and the spanning tree of NA for M = 2. Figures 
3 to 5 illustrate the locality graphs and the spanning trees for 
domain sizes 3 to 5. The semilinear set of the NA protocol 
can be specified as the union of the following linear sets: 

e Linear set 1: The base vector is (1,1), and the period 

vector is (0,0). That is, for the unbounded domain M, 
this set would be equal to {(a;-1,24) | a1 = (1+ 
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(a) Locality graph representing 
predicate L(xi—1, xi) in NA. 


(b) A spanning tree rooted at 1. 


Fig. 2: Locality graph and a spanning tree of NA for M = 2. 


(a) Locality graph representing 
predicate L(x;—1, xi) in NA. 


(b) A spanning tree rooted at 1. 
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(a) Locality graph representing 
predicate L(xi—1, xi) in NA. 
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(b) A spanning tree rooted at 1. 


Fig. 4: Locality graph and a spanning tree of NA for M = 4. 


Fig. 3: Locality graph and a spanning tree of NA for M = 3. 


à0) = 1 and zi = (1 + A0) = 1 where A € N}. Since 
the period vector is (0,0), this set includes just a single 
vector; i.e., {(1,1)}. Thus, we have $(2;_1) = (aj-1 = 
1), Y(£i—1, 04) = (a, = 1) and Yy (aj_-1) = 1 for this 
linear set. ' 

e Linear set 2: The base vector is (0,1), and the period 
vector is (0,0). Thus, we have ¢(2;-1) = (xi-1 = 0), 
w(xi-1, £4) = (x; = 1) and Yr! (zi—1) = 1. 

e Linear set 3: Using the base vector (2,1), and the period 
vector (1,1), this linear set is specified as {(£;—1, 2%) | 
Zi—1 = 2+ and x, = 14+ where A € N}. Step 10 gives 
(ai-1) = (a;-1 = 2+), which means $(2;-1) = 

(a;-1 > 2). Moreover, we have Y(xi—1, 74) = (x; = 

xj-1 — 1), and Yz; (£i—1) = (x; — 1). 
The union of the above linear sets forms a semilinear set 
that captures the unbounded spanning tree of the NA protocol. 


Theorem III.2. Algorithm 3 terminates and is sound. That 
is, it correctly generates a semilinear set representing an 
unbounded spanning tree rooted at ¥. 


Proof. Due to space constraint, we provide a proof sketch here 
and refer interested readers to [21] for the complete proof. The 
proof of termination follows from the finiteness of the upper 
bound 8. The proof of soundness includes two parts. First, we 
show that the common core C constructed in Step 5 is a finite 
union of some linear sets. Second, we prove that the union 
of C and the unbounded core generated in Steps 7 and 8 is a 
semilinear set representing an unbounded spanning tree rooted 
at y. We show this by induction on M. 


D. Synthesizing Parameterized Actions from Linear Sets 


This section presents a method for the synthesis of param- 
eterized actions of self-stabilizing protocols from linear sets. 
Each linear set in the semilinear set represents the structure 
of an individual action in a protocol with deterministic and 


(a) Locality graph representing 
predicate L(xi—1, xi) in NA. 


(b) A spanning tree rooted at 1. 


Fig. 5: Locality graph and a spanning tree of NA for M = 5. 


self-disabling process. However, such a structure lacks details 
of the guard and statement of each action. Thus, the question 
is: how do we synthesize the guard of each action? and how 
do we synthesize the statement of each action? The guard 
of each action includes three components: (1) its structure 
(taken from a linear set); (2) ~L(aio1,2;), and (3) the self- 
disabling condition, which is the negation of the statement of 
the action. Since a linear set contains integer vectors (a,b) 
where a represents the value that x;_; should have before the 
value of x; is updated to b, the first component of a guard 
includes all values of x;_; that make the formula ¢(2;_1) 
true, and the statement of the guard should make w(x;_1, x4) 
true. Moreover, an action is enabled for all values of x; (in 
the current state of a process) that make L(2;_,,2;) false, 
which is why ~L(x;-1,2;) is a part of the guard condition. 
The statement of the action should make L(2;_1,2;) true. 
Moreover, once an action is executed, it should disable itself; 
i.e., self-disabling assumption. This means that the guard 
of an action should contain the negation of the expression 
that holds after the execution of the action. Thus, the third 
component of a guard is 7(2;~1,2;). In the computation 
of w(a;-1,2;), Algorithm 4 uses the values of z;—ı and 
x; in the current state of process P;, before x; is updated. 
In summary, the guard of each action would be equal to 
(ai-1) A AL (a#i-1, Xi) A a(x; = Pa! (xi—1)) (see Algorithm 
4). Since x; represents the updated value of x; in Y(xi—1, £4), 
one can refactor Y(x;—1, £) in order to generate Pas (zi—1), 
which denotes %(x;—1, £4) modulo x;. That is, Ys: (£i—1) 
treats x; as a function of x;—ı. This way, we create the 
assignment T; := Py, (zi—1) in Line 2 of Algorithm 4. 
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Algorithm 4. Gen_Actions($(ai-1), Y(£i—1, x4): Presburger 
formula corresponding to a linear set, L(aj-1,x;): State 
predicate) 

i G = Oe) ANAL 150) A (2i Z Vas (O74) 

2: AS (xi := Pas (£i—1)) 

3: Return G > A 
end 


1) Example: Synthesis of the Actions of the NA Protocol: 
We first demonstrate how we generate the action correspond- 
ing to the linear set (1, 1). We take the output of Algorithm 3 
for this linear set (i.e., (xi—1) = (zi-1 = 1), W(ai-1, 2%) = 
(x; = 1) and Yz: (vi-1) © 1) and generate its action. 

e AL (axi-1, Xi): Since L(zi—1, zi) = (£zi—1 = Xi) V 
(xiz1 = xi + 1), we include the constraint (x;i—-ı1 4 
zi) A (£i—1 Æ zi + 1) in the guard of this action. 

e Linear set constraint: This linear set imposes the con- 
straint ¢(a;-1) = (@;-1 = 1) on the guard of the action. 
Self-disabling constraint: We use (a1, £) = (a, = 1) 
to specify this constraint. To this end, we first determine 
the assignment of the action using Ys, (xzi—1) “1. Thus, 
the assignment is just x; := 1. As a result, the self- 
disabling constraint is the negation of x; = 1; i.e., 7; #1. 

Thus, the synthesized action is (£;—1 = 1) A(aj_1 Æ zi) A 
(aj_-1 Æ ti +1) A (z; Æ 1) > 2; := 1. Likewise, the action 
generated from the linear set {(0,1)} is (aj-1 =O)A (ai-1 # 
zi) A (ai-1 Æ Hj +1)A(a; #1) > a; := 1. We now generate 
the action corresponding to the linear set { (£;—1, £4) | £i-1 = 
2+. and z; = 1 + A where à € N}. Corresponding to 
this unbounded linear set, Algorithm 3 generates $(x;_1) = 
(zi-1 > 2), Y(zi-1, 04) S (x; = 21 — 1) and Yy (2-1) = 
(a;-1 — 1). We first synthesize the three components of the 
guard of this action, and then generate its assignment. 

e =~L(xzi—1, £i): This part is again (xj;-1 Æ xi) A (ai-1 Æ 

x; +1) for the same reason discussed for the first action. 

e Linear set constraint: The constraint $(2;-1) requires 
that we include (x;—ı > 2) as part of the guard condition. 

e Self-disabling constraint: Using Wz" (xi—1) = (a;-1—1), 
we realize that the assignment of this action establishes 
the condition (x; = x;_,; — 1). Thus, we include the 
constraint (x; Æ £i—1— 1) in the guard, and x; := £i—1— 
1 as the assignment of this action. 


Putting everything together, we get the following action for 
this unbounded linear set: (x;—1 > 2) A (ai-1 Æ £i) A (ai A 
Ti—1 1) > Zi := Ti—1 1. 

Sample executions. Consider a computation of a ring of four 
processes for a domain size M = 4 (i.e., x; € Z4) starting at 
the state so = (0, 2, 1,3), where the underlined values indicate 
the enabled processes based on the synthesized actions. That 
is, processes Po, P} and P are enabled. For example, Po is 
enabled because zọ = 0 A x3 = 3 and the third action is 
enabled. Using a similar reasoning, one can figure out why 
P, and P; are enabled at sọ. For brevity, we demonstrate a 
synchronous execution of this ring, but one can extract an 
asynchronous interleaving of processes that converges to the 


same final state. Starting at sg, all three enabled processes 
can execute, where the entire ring transitions to the state 
sı = (2,1,1,1), and then reaches the state s2 = (1,1,1,1), 
where everyone agrees with its predecessor. For a domain size 
M = 5 and an arbitrary start state (0, 2, 0, 3), the NA protocol 
generates the following computation: (2,1,1,1), (1,1,1,1). 
As another example, consider a larger ring of five processes 
and M = 5. Starting at (0, 4, 2, 3, 1), the NA protocol will con- 
verge through the following states: (0, 4, 3, 1, 2), (1, 4,3,2,1), 
(1,1,3,2, 1), (1,1,1,2, 1), (1,1,1,1, 1). Yet another example 
includes a case of M = 7 and six processes in the ring. 
ing converging computation: (3,5,1, 1,2,5), (4,2,4,1,1,1), 
(1,3,1,3,1,1), (1,1,2,1,1,1), (1,1,1,1,1,1). Observe that, 
the synthesized NA protocol is self-stabilizing for different 
ring sizes and domain sizes. 


IV. PARITY PROTOCOL 


This section demonstrates the synthesis of a Parity pro- 
tocol, where processes in the uni-ring should converge to 
an agreed-upon parity starting from any arbitrary state. For- 
mally, the entire ring should self-stabilize to states where 
Vi: i € N : (|xi-1 — xi| mod 2) = 0 holds. (Notice that, 
|£i-1 — zil =max(£;i—1 =i ti — Xi-1).) Figures 6 to 9 
illustrate how the spanning tree of Parity grows as the domain 
size increases. The common core is {(0,0), (1,0), (2,0)} 
because M = 3 is the first domain size for which there is a 
solution. We synthesize an action corresponding to each linear 
set. 

e Linear set I: The self-loop on O can be represented as 

a linear set with the base vector (0,0) and the period 
vector (0, 0). Algorithm 3 outputs 4(;~1) = (#1 = 0), 
(xi-1, 24) = (x; = 0), and Wz: (ai-1) = 0. Thus, the 
assignment of the action is x; := 0, and the requirement 
of having self-disabling actions would be x; 4 0. The 
constraint ~L(x;—1, £i) provides (|a;-1 — z;| mod 2) Æ 
0. Thus, the synthesized action is (x;—1 = 0) A ((\ai_-1 — 
x;|mod 2) £0) A (a; #0) > z; := 0. 

e Linear set 2: The base vector of this linear set is (1,0) 
and its period vector is (0,0). As a result, we have 
P(ti1) = (ti-1 = 1), H(ai-1, 24) = (x; = 0), and 
Pet (xi—1) © 0. The assignment of the action is x; := 0, 
which leads to the self-disabling constraint x; # 0. The 
constraint =L(x;—1, £i) provides ((|x;—-1 —x;| mod 2) Æ 
0). Thus, the synthesized action is (£;—1 = 1)A((|ti-1—- 
x;|mod 2) £0) A (a; #0) > z; := 0. 

e Linear set 3: The base vector of this linear set is (2,0) 
and its period vector is (0,0). As a result, we have 
@(ai-1) = (£i—1 = 2), p(zi—1, £4) = (x; = 0), and 
Yas (xzi—1) = 0, The assignment of the action is x; := 0, 
which leads to the self-disabling constraint x; # 0. The 
constraint ~L(x;_1,%;) provides ((|x;_; — x;| mod 2) Æ 
0). Thus, the synthesized action is (£z;—1 = 2)A ((|£i—-1 — 
x;|mod 2) £0) A (a; 40) > z; := 0. 

e Linear set 4: Using the base vector (3,1) and the period 
vector (1,1), this linear set contains integer vectors S4 = 
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{(vi-1, £4) | (@i-1 = 3 + A) A (z; = where 
à € N}. Algorithm 3 gives us ọ(zi-1) = (aj-1 = 3 + 
A), which can be written as ¢(2;_-1) = (aj-1 > 3). 
Algorithm 3 also outputs w(a;_1, x) = (xi = zi-1— 2), 


and Yr: (£i—1) = (x;—1—2). The assignment of the action 


is obtained from Yr, (xzi—1) = (a;-1 — 2), leading to 
£i := X;-1—2. Thus, the synthesized action for this linear 
set is (a1 > 3) A ((|£i—-1 — zi| mod 2) # 0) A (a; F 
Lio 2) > UH W-1 - 2A 
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(a) Locality graph representing 
predicate |z;—1—z;| mod 2 = 0 
in the Parity protocol. 


nO 


(b) A spanning tree rooted at 0. 


Fig. 6: Locality graph and a spanning tree of the Parity 
protocol for domain size 2. 
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(a) Locality graph representing (b) A spanning tree rooted at 0. 
predicate |x;—1—x;| mod 2 = 0 


in the Parity protocol. 


Fig. 7: Locality graph and a spanning tree of the Parity 


protocol for domain size 3. 


(a) Locality graph representing (b) A spanning tree rooted at 0. 
predicate |x;—1—x;| mod 2 = 0 


in the Parity protocol. 


Fig. 8: Locality graph and a spanning tree of the Parity 
protocol for domain size 4. 


V. RELATED WORK 


This section discusses the state-of-the-art in the verification 
and synthesis of parameterized systems, especially unbounded 
and infinite-state systems. For example, predicate abstraction 
[22], [23] enables a method for creating a finite-state repre- 
sentation of infinite-state systems where safety properties can 


Fig. 9: A spanning tree of the Parity protocol for domain size 
5. 


be verified. Constraint language programming [24] enables 
the verification of safety properties of concurrent systems 
with unbounded data. Approaches for reachability analysis 
of generalized Petri nets [25], [26] apply over-approximation 
towards generating a finite model, and then develop an efficient 
semi-decision procedure for forward reachability analysis. 
Counter abstraction [27] utilizes integer counters to count the 
number of processes in a specific state, but such abstractions 
are too coarse for the design of self-stabilizing protocols 
where recovery must be ensured from every concrete state. 
Environment abstraction [28] extends counter abstraction in 
order to model the abstract state and the environment of each 
process. Invisible invariants [29], [30] infer an invariant of a 
parameterized system by examining a few small instantiations 
of protocols. Indexed predicates [31] provide a method for 
the generation and verification of invariant predicates specified 
in terms of the process indices in infinite-state systems. The 
aforementioned methods mostly aim at the verification of 
safety and local liveness properties, and it is unclear how they 
can synthesize self-stabilizing unbounded protocols. 


Most methods for the synthesis of parameterized unbounded 
systems provide little results for the synthesis of unbounded 
self-stabilizing protocols, where a global liveness property 
(i.e., convergence) must be met from any state in an unbounded 
state space. For example, synthesis of Petri nets [32], [33], [34] 
mainly focuses on the transformation of behavioral specifica- 
tions in the form of labeled transition systems to Petri nets. 
UCLIDS [35], [36] provides a framework for modular verifi- 
cation and synthesis of the artifacts (e.g., invariants, assume- 
guarantee conditions) that are used during verification. Syntax- 
Guided Synthesis (SyGus) [37] generates the implementation 
of a set of functions (each adhering to a grammar) in the 
specification of a system for a background logic theory. It 
is unclear how one can use SyGus to synthesize the actions 
of SS-SymU protocols which must interact asynchronously to 
ensure convergence in a specific topology. Moreover, methods 
that combine SyGus with reactive synthesis are mostly applied 
to centralized systems [38]. Oracle-Guided Inductive Synthesis 
(OGIS) [39], [40], [41] is based on iterative query-response in- 
teractions between a learner and a teacher towards synthesizing 
a system that adheres to formal specifications. Utilizing OGIS 
in the synthesis of self-stabilizing unbounded systems may not 
converge to a solution that must recover from any state rather 
than recovery from a proper set of initial states. While the 
existing synthesis methods inspire our work, the novelty of 
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our approach mainly lies in the characterization of unbounded 
actions as semilinear sets for the synthesis of SS-SymU. 


VI. CONCLUSIONS AND FUTURE WORK 


This paper investigated the problem of synthesizing self- 
stabilizing symmetric protocols (SS-SymU) on uni-rings, 
where a ring can have an unbounded number of processes and 
processes have unbounded variables. While previous research 
[5] has addressed this problem for rings of unbounded size, 
we are not aware of any work that synthesizes self-stabilizing 
protocols having unbounded variables too. We first showed 
that the ability to represent unbounded actions of a protocol as 
semilinear sets is sufficient for synthesis. This reduces the syn- 
thesis of SS-SymU to the synthesis of semilinear sets. Then, 
we presented a sound algorithm that generates a semilinear 
set for a protocol from which the parameterized actions of 
the protocol are derived. We demonstrated how our algorithm 
can generate SS-SymU protocols (e.g., near agreement and 
parity on unbounded uni-rings) that were previously infeasible. 
We are currently implementing the proposed method as a 
synthesizer and are investigating the feasibility of synthesis 
for more complicated protocols and topologies. We would 
also like to know how semilinear sets can be utilized for the 
verification and synthesis of unbounded protocols that satisfy 
general temporal properties (instead of just self-stabilization). 
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Abstract—We present the RAPID framework for automatic soft- 
ware verification by applying first-order reasoning in trace 
logic. RAPID establishes partial correctness of programs with 
loops and arrays by inferring invariants necessary to prove 
program correctness using a saturation-based automated theorem 
prover. RAPID can heuristically generate trace lemmas, common 
program properties that guide inductive invariant reasoning. 
Alternatively, RAPID can exploit nascent support for induction 
in modern provers to fully automate inductive reasoning without 
the use of trace lemmas. In addition, RAPID can be used as an 
invariant generation engine, supplying other verification tools 
with quantified loop invariants necessary for proving partial 
program correctness. 


I. INTRODUCTION 


State-of-the-art deductive verification tools for programs con- 
taining inductive data structures ([1], [2], [3], [4], [5]) largely 
depend on satisfiability modulo theories (SMT) solvers to dis- 
charge verification conditions and establish software correct- 
ness. These approaches are mostly limited to reasoning over 
universally-quantified properties in fragments of first-order 
theories: arrays, integers, etc. In contrast, RAPID supports 
reasoning with arbitrary quantifiers in full first-order logic with 
theories [6]. Program semantics and properties are directly 
encoded in trace logic by quantifying over timepoints of pro- 
gram execution. This allows simultaneous reasoning about sets 
of program states, unlike model-checking approaches [2][7]. 
The gain in expressiveness is beneficial for reasoning about 
programs with unbounded arrays [6] or to prove security 
properties [8], for example. 

This paper presents what RAPID can do, sketches its design 
(Section III), and describes its main components and imple- 
mentation aspects (Sections IV-VII). Experimental evaluation 
using the SV-CompP benchmark [9] shows RAPID’s efficacy 
in verification (Section VIID. 

Given a program loop annotated with pre/post-conditions, 
RAPID offers two modes for proving partial program correct- 
ness. In the first, RAPID relies on so-called trace lemmas, 
apriori-identified inductive properties that are automatically 
instantiated for a given program. In the second, RAPID 
delegates inductive reasoning to the underlying first-order 
theorem prover [10][11], without instantiating trace lemmas. 
In either mode, the automated theorem prover used by RAPID 
is VAMPIRE [12]. RAPID can also synthesize quantified invari- 
ants from program semantics, complementing other invariant- 
generation methods. 
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func main() { 
const Int[] a; 
const Int alength; 
int [] 5; ©; 


5 Int blength, clength, i = 0, 0, 0; 
6 while(i < alength) { 

if(a[i] >= 0) { 
8 b[blength] = a[i]; 
blength = blengtht1; 
10 } else { 
1 c[clength] = a[i]; 
12 clength = clength+1; 

i = i+1l; 


13 } 


Fig. 1: Program partitioning an array a into two arrays b, c 
containing positive and negative elements of a respectively. 


Related Work: Verifying programs with unbounded data struc- 
tures can use model checking for invariant synthesis. Tools like 
Spacer/Quic3 ([4], [2]), SEAHORN [1] or FREQHORN [7] are 
based on constrained horn clauses (CHC) and use either fixed- 
point calculation or sampling/enumerating invariants until a 
given safety assertion is proved. These approaches use SMT 
solvers to check validity of invariants and are limited to 
quantifier-free or universally-quantified invariants. Recurrence 
solving and data-structure-specific tactics can be used to infer 
and prove quantified program properties [3]. DIFFY [13] and 
VAJRA [5] derive relational invariants of two mutations of a 
program such that inductive properties can be enforced over 
the entire program, without invariants for each individual loop. 


II. MOTIVATING EXAMPLE 


We motivate RAPID using the program in Figure 1, written in a 
standard while-like programming language W. Each program 
in W consists of a single top-level function main, with arbi- 
trary nestings of if-then-else and while statements. W includes 
optionally-mutable integer (array) variables, and standard side- 
effect-free expressions over Booleans and integers. 

Semantics and properties of W-programs are expressed in 
trace logic L, an instance of many-sorted first-order logic with 
theories and equality [6]. A timepoint in trace logic is a term of 
sort L that refers to a program location. For example, ls refers 
to Line 5 in Figure 1. If a program location occurs in a loop, 
a timepoint is represented by a function l : N => L, where the 
argument is a natural number representing a loop iteration. 
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Fig. 2: Overview of the RAPID verification framework. 


For example, /g(0) denotes the first iteration of the loop before 
entering the loop body. A mutable scalar variable v is modeled 
as a function over time v : L +> I. An array variable is 
modeled as a function v : L x I +> I, where array indices 
are represented by integer arguments. For constant variables 
we omit the timepoint argument. We use a constant nl; : N to 
denote the last iteration of the loop starting at l;. When a loop 
is nested within other loops, the last iteration is a function over 
timepoints of all enclosing loops; lena denotes the timepoint 
after program execution. For Figure 1, Ig(nlg) denotes the 
program location of the loop at its last iteration, when the loop 
condition no longer holds. We assume that programs terminate, 
and hence RAPID focuses on partial correctness. 

Figure | creates two new arrays, b and c, containing positive 
and negative elements from the input array a respectively. Note 
that the arrays are unbounded, and we use the symbolic, non- 
negative constant alength to bound the length of the input 
array a. The constraint that alength be non-negative can be 
expressed within a conjecture (see (1) below for example). A 
safety property we want to check is that for any position in 
b there exists a position in a such that both values are equal 
within the respective array bounds (and similarly for c). This 
equates to the following conjecture expressed in trace logic!: 


Vposy. dpos;. 0 < pos < blength(lena) ^A alength > 0 > 
0 < pos’ < alength ^ b(lena, pos) = a(pos’), 
(1) 
To the best of our knowledge other verification approaches 
cannot automatically validate (1) due to quantifier alternation, 
but RAPID proves this property for Figure 1. 


III. THE RAPID FRAMEWORK 


The RAPID framework consists of approximately 10,000 lines 
of C» ?, Figure 2 summarizes the RAPID workflow. Inputs to 
RAPID are programs P written in W along with properties 
F expressed in £. Preprocessing in RAPID applies program 
transformations for common loop-altering programming con- 


lwe write Vag. F or Jxg. F to mean that x has sort S in F 


?available at https://github.com/vprover/rapid 


1 while(i < alength) { 


2 if (a[i] == x) { 
3 break; 
4 } 
i = i + 1; 
6 } 
1 Bool break = false; 


2 while(i < alength && !break ) { 
3 if (a[i] == x) { 

4 break = true; 

5 } 

6 iE 

7 i = 


(!break) { 
1 Ls 


Fig. 3: Loop tranformation for break-statement. 


structs, as well as timepoint inlining to obtain a simplified 
program P’ from P (see Section IV). 

Next, RAPID performs inductive verification (see Section V) 
by generating the axiomatic semantics [P’] expressed in £ 
and instantiating a set L1,...,L,, of inductive properties — 
so-called trace lemmas — for the respective program variables 
of P’. For establishing some property F, RAPID supports 
two modes of inductive verification: standard and lemmaless 
mode. The difference in both versions relates to the underlying 
support for automating inductive reasoning while proving F. 
The standard verification mode equips the verification task 
with the trace lemmas L1,..., Ln, providing the necessary 
induction schemes for proving F. The lemmaless verification 
mode uses built-in inductive reasoning and relies less, or 
not at all, on trace lemmas. In either mode, the verification 
tasks of RAPID are encoded in the SMT-LIB format. Finally, 
a third and recent RAPID mode can be used for invariant 
generation (see Section VII). In this mode, RAPID “only” 
outputs quantified invariants using the SMT-LIB syntax; these 
invariants can further be used by other verification tools. 


IV. PREPROCESSING IN RAPID 


a) Program Transformations: We use standard program trans- 
formations to translate away break, continue and return 
statements. For these, RAPID introduces fresh Boolean pro- 
gram variables indicating whether a statement has been ex- 
ecuted. The program is adjusted accordingly: return state- 
ments end program execution; break statements invalidate 
the first enclosing loop condition; and for continue the 
remaining code of the first enclosing loop body is not executed. 
Example 1: Figure 3 shows a standard transformation for a 
break-statement. 

b) Timepoint Inlining: RAPID uses SSA-style inlining [14], 
[15], [16] for timepoints to simplify axiomatic program se- 
mantics and trace lemmas of a verification task. Specifically, 
RAPID caches (i) for each integer variable the current program 
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asat 2; 

b = 3; 

c =a + b; 

assert (a (lend) < C (lena) ) 


(a) block assignments 


if (x < 1) { 


x = 0; 
} else { 
skip; 
} 
6 while (y > 0) { 
y=y-li 
3 } 
10 assert (x(lend) > 0) 


(b) simple branching 


Fig. 4 


expression assigned to it, and (ii) for each integer-array 
variable the last timepoint where it was assigned. Cached 
values are used during traversal of the program tree to simplify 
later program expressions. Thus we avoid defining irrelevant 
equalities of program variable values over unused timepoints, 
and only reference timepoints relevant to the property. We 
illustrate this on two examples: 

Example 2 (Inlining assigned integer expressions): The effect 
of inlined semantics can be observed when we encounter block 
assignments to integer variables: we can skip assignments and 
use the last assigned expression directly in any reference to 
the original program variable. Consider the partial program in 
Figure 4a. Our axiomatic semantics in trace logic [6] would 
result in 


a(le) =a(:)+2 A b(t) =b(h) A 

c(l) =c(l1) ^ all3) = a(l2) ^ 

b(l3)=3 ^  c(l3)= c(l2) A 

Alena) = (ls) A B(lena) = (Is) ^ 
c(lena) = a(l3) + b(l3) 


whereas the inlined version of semantics is drastically shorter: 


A(lena) = a(l,) +2 A C(lena) = (a(l) + 2) +3. 


In contrast to the extended semantics that define all program 
variables for each timepoint, the inlined version only considers 
the values of referenced program variables at the timepoint of 
their last assignment. Thus, when c is defined, RAPID directly 
references the (symbolic) values assigned to a and b. While 
b is not defined at all, note that a is defined as a(lena) is ref- 
erenced in the conjecture. Furthermore, the inlined semantics 
only make use of two timepoints, l1, and lena, as the remaining 
timepoints are irrelevant to the conjecture. 

Example 3 (Inlining equalities with branching.): Figure 4b 
shows another program that benefits from inlining equalities, 


as well as only considering timepoints relevant to the con- 
jecture. The original semantics defines program variables x 
and y for all program locations: 11, l2, l3, l4, lg(it), le(nle), 
lena, for some iteration it and final iteration nlg. While the 
program contains two variables x and y, only x is used in the 
property we want to prove. Since no assignments to x contain 
references to y, the loop semantics do not interfere with x, so 
we have 


TAN 
(Is) ^ 


a(l3) < 1 > z(le(0)) = 
x(l3) > 1 > z(l6(0)) = 
(lena) = x(le(0)) 


where the semantics of the loop defining y are omitted. Note 
that all timepoints of the if-then-else statements are flattened 
into the timepoint at the beginning of the loop at le in iteration 
0. The axiomatic semantics thus reduce to three conjuncts 
defining the value of x throughout the execution. However, 
x is not defined in any loop iteration other than the first as 
they are irrelevant to the property. 

c) User-defined input: RAPID is fully automated. However, it 
may still benefit from manually-defined invariants to support 
the prover. Users can therefore extend the input to RAPID with 
first-order axioms written in the SMT-LIB format. 


8 Oo 


V. INDUCTIVE VERIFICATION IN RAPID 


As mentioned above, RAPID implements two verification 
modes; in the default standard mode, RAPID uses trace lem- 
mas to prove inductive properties of programs. In its lemmaless 
mode RAPID relies on built-in induction support in saturation- 
based first-order theorem proving. In this section we elaborate 
on both modes further. 


A. Standard Verification Mode: Reasoning with Trace Lemmas 


RAPID’s standard mode relies on trace lemma reasoning to 
automate inductive reasoning. Trace lemmas are sound for- 
mulas that are: (i) derived from bounded induction over loop 
iterations; (ii) represent common inductive program properties 
for a set of similar input programs; and (iii) are automatically 
instantiated for all relevant program variables of a specific 
input program during its translation to trace logic; see [6]. 
In all of our experiments from Section VIII, including the 
example from Figure 1, we only instantiate three generic 
inductive trace lemmas to establish partial correctness. One 
such trace lemma asserts, for example, that a program variable 
is not mutated after a certain execution timepoint. 

Example 4: Consider the safety assertion (1) of our running 
example from Figure 1. In its standard verification mode, 
RAPID proves correctness of (1) by using, among others, the 
following trace lemma instance 


Vii VbLy. vonn: ( 
Vitn.((bn < it <br A b(lolbz), j) = bllo(ét), j)) 
= b(lo(bz), j) = b(lo(s(it)), j)) 
e E W(la(bn)))) 
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stating that the value of b at some position j is unchanged 
between two bounds bz, and bp if, for any iteration it and its 
successor s(it), values of b are unchanged. 

Multitrace Generalization: RAPID can also be used to prove 
k-safety properties over k traces, useful for security-related 
hyperproperties such as non-interference and sensitivity [8]. 
For such problems it is sufficient to extend program variables 
to functions over time and trace, such that program variables 
are represented as (L x T +> I). Program locations, and hence 
timepoints, are similarly parameterized by an argument of sort 
T to denote the same timepoint in different executions. 


B. Lemmaless Verification Mode 


When in lemmaless mode RAPID does not add any trace 
lemma to its verification task but relies on first-order theorem 
proving to derive inductive loop properties. An extended 
version of SMT-LIB (see Section VI) is used to provide the un- 
derlying prover with additional information to guide the search 
for necessary inductive schemes, such as likely symbols for 
induction. We further equip saturation-based theorem proving 
with two new inference rules that enable induction on such 
terms; see [17] for details. Multi-clause goal induction takes 
a formula derived from a safety assertion that contains a final 
loop counter, that is a symbol denoting last loop iterations, 
and inserts an instance of the induction schema for natural 
numbers with the negation of this formula as its conclusion 
into the proof search space. For example, consider the formula 
x(ls(nls5)) < 0. Multi-clause goal induction introduces the 
induction hypothesis x(/5(0)) > 0 ^A Vity. (it < nls A 
u(ls(it)) > 0) > a(Is(s(it))) > 0 — a(ls(nls)) > 0. If 
the base and step cases can be discharged, a contradiction can 
be easily produced from the conclusion and original clause. 
Array mapping induction also introduces an instance of the 
induction schema to the search space, but is not based on 
formulas derived from the goal. Instead, this rule uses clauses 
derived from program semantics to generate a suitable con- 
clusion for the induction hypothesis. 


VI. VERIFYING PARTIAL CORRECTNESS IN RAPID 


For proving the verification tasks of Section V, and thus veri- 
fying partial program correctness, RAPID relies on saturation- 
based first-order theorem proving. To this end, each verifica- 
tion mode of RAPID uses the VAMPIRE prover, for which we 
implemented the following, RAPID-specific adjustments. 
a) Extending SMT-LIB: Each verification task of RAPID is 
expressed in extensions of SMT-LIB, allowing us to treat some 
terms and definitions in a special way during proof search: 
(i) declare-nat: The VAMPIRE prover has been extended 
with an axiomatization of the natural numbers as a term 
algebra, especially for RAPID-style verification purposes. 
We use the command (declare-nat Nat zero s p 
Sub) to declare the sort Nat, with constructors zero and 
successor s, predecessor p and ordering relation Sub. 
declare-lemma-predicate: Our trace lemmas are 
usually of the form (P, A... A Pa) > Conclusiony for 
some trace lemma L with premises P; A... A Pn. In terms 


(ii) 


of reasoning, it makes sense for the prover to derive the 
premises of such a lemma before using its conclusion 
to derive more facts, as we have many automatically 
instantiated lemmas of which we can only prove the 
premises of some from the semantics. To enforce this, we 
adapt literal selection such that inferences from premises 
are preferred over inferences from conclusions. Lemmas 
are split into two clauses (P, A... A Pa) > Premiser 
and Premise, — Conclusion, where Premisey 
is declared as a lemma literal. We ensure our literal 
selection function selects either a negative lemma literal? 
if available, or a positive lemma literal only in combina- 
tion with another literal, requiring the prover to resolve 
premises before using the conclusion. 

The lemmaless mode of RAPID introduces the following 

additional declarations to SMT-LIB: 

(i) declare-const-var: assign symbols representing con- 
stant program variables a large weight in the prover’s 
term ordering, allowing constant variables to be rewritten 
to non-constant expressions. 
declare-timepoint: distinguish a symbol representing 
a timepoint from program variables, guiding VAMPIRE to 
apply induction upon timepoints. 
declare-final-loop-count: declare a symbol as a 
final loop count symbol, eligible for induction. 

b) Portfolio Modes: We further developed a collection of 
RAPID-specific proof options in VAMPIRE, using for example 
extensions of theory split queues [18] and equality-based 
rewritings [19]. Such options have been distilled into a RAPID 
portfolio schedule that can be run with --mode portfolio 
-sched rapid. Moreover, the multi-clause goal induction 
rule and the array mapping induction inference of RAPID 
have been compiled to a separate portfolio mode, accessed 
via --mode portfolio -sched induction_rapid. 


(ii) 


(iii) 


VII. INVARIANT GENERATION WITH RAPID 


RAPID can also be used as an invariant generation engine, 
synthesizing first-order invariants using the VAMPIRE theorem 
prover. To do so, we use a special mode of VAMPIRE to 
derive logical consequences of the semantics produced by 
RAPID. Some of these consequences may be loop invariants. 
The symbol elimination approach of [20] defined some set of 
program symbols undesirable, and only reports consequences 
that have eliminated such symbols from their predecessors. In 
RAPID, we adjust symbol elimination for deriving invariants 
in trace logic using VAMPIRE. These invariants may contain 
quantifier alternations, and some conjunction of them may well 
be enough to help other verification tools show some property. 
When RAPID is in invariant generation mode, the encoding 
of the problem is optimized for invariant generation. We limit 
trace lemmas to more specific versions of the bounded induc- 
tion scheme. We also remove RAPID-specific symbols such as 
lemma literals so that they do not appear in consequences. 


3Note that lemma literals become negative in the premise definition after 
CNF-transformation. 
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Symbol Elimination: Loop invariants should only contain 
symbols from the input loop language, with no timepoints. 
To remove such constructs, we apply symbol elimination: any 
symbol representing a variable v used on the left-hand side 
of an assignment is eliminated. However, we still want to 
generate invariants containing otherwise-eliminated variables 
at specific locations, so for each eliminated variable v we de- 
fine v_init = v(l,) and v_final = v(l2) for appropriate 
locations lı, l2: these new symbols need not be eliminated. 
We further adjusted symbol elimination in RAPID to output 
fully-simplified consequences during proof search in VAMPIRE 
(the so-called active set [12]) at the end of a user-specified 
time limit. Consequences that contain undesirable symbols or 
are pure consequences of theories are removed at this stage. 
Reasoning with Integers vs. Naturals: In the standard setting, 
RAPID uses natural numbers (internally Nat) to describe loop 
iterations. However, in some situations it is advantageous to 
use the theory of integers: loop counter variable i of sort I will 
have the same numerical value as nl of sort N at the end of 
a loop. Integer-based timepoints allow deriving i(l(nl)) = nl. 
Such a clause can be very helpful for invariant generation, as 
shown in Example 5. 

Example 5: Consider the property Vay.0 < x < alength > 
a(x) = b(x). The property essentially requires us to prove 
that two arrays a, b are equal in all positions between 0 and 
alength. Such a property might for example be useful to 
prove when we copy from an array b into array a in a loop 
with loop condition i < alength where i is the loop counter 
variable incremented by one in each iteration. Now when we 
run RAPID in the invariant generation mode, we might be 
able to derive a property V7.0 < x < nl — a(x) = b(x), 
essentially stating that the property holds for all iterations of 
the loop. The prover can further easily deduce that i(1(nl)) > 
alength thanks to our semantics. 

However, in case of natural numbers we cannot deduce that 
i(l(nl)) = nl since the sorts of i and n1 differ. In order to 
derive an invariant strong enough to prove the postcondition 
we depend upon the prover to find the invariant Yx.0 < x < 
i(l(nl)) > a(x) = b(a) directly which cannot be deduced by 
the prover as our loop semantics are bounded by loop iterations 
rather than the loop counter values. 

When using -integerIterations on we can circumvent 
this problem as the prover can then simply deduce the equality 
i(l(nl)) = nl which makes the conjunction of clauses strong 
enough to prove the desired postcondition. 


VIII. EXPERIMENTAL EVALUATION 


We evaluated the two verification modes of RAPID and com- 
pare against the state-of-the-art solvers DIFFY and SEAHORN, 
as summarized below. 

Benchmark Selection: Our benchmarks* are based on the 
c/ReachSafety—Array category of the SV-COMP reposi- 
tory [21], specifically from the array-examples/ x subcat- 
egory> as it contains problems suitable for our input language. 


“https://github.com/vprover/rapid/tree/main/examples/arrays 
Shttps://github.com/sosy-lab/sv-benchmarks/tree/master/c/array-examples 


TABLE I: Experimental Results 


DIFFY 
61 (1) 


SEAHORN 
17 ©) 


Total | RAPIDstq 
140 91 (5) 


RAPIDjemmaless 
103 (10) 


Other examples are not yet expressible in W due to the 
presence of function calls and/or unsupported memory access 
constructs. We manually translate all programs to W and 
express pre/post-conditions as trace logic properties. Addition- 
ally, we extend some SV-COMP examples with new conjec- 
tures containing existential and alternating quantification. 

In general SV-COMP benchmarks are bounded to a certain 
array size N. By contrast, we treat arrays as unbounded 
in RAPID and reason using symbolic array lengths. Some 
benchmarks in the original SV-COMP repository are minor 
variations of each other that differ only in one concrete integer 
value, e.g to increment a program variable by some integer. 
Instead of copying each such variation for different digits, 
we abstract such constant values to a single symbolic integer 
constant such that just one of our benchmark covers numerous 
cases in the original SV-COMP setup. 

Results: We compare our two RAPID verification modes, in- 
dicated by RAPIDs¢q and RAPIDjemmaless respectively, against 
SEAHORN and DIFFY. All experiments were run on a cluster 
with two 2.5GHz 32-core CPUs with a 60-seconds timeout. 
Note that DIFFY produced the same results as its precursor 
VAJRA in this experiment. Table I summarizes our results, 
parentheticals indicating uniquely solved problems. Of a total 
of 140 benchmarks, RAPIDstą solves 91 problems, while 
RAPIDjemmaless surpasses this by 12 problems. Particularly, 
RAPIDjemmaless COUld solve more variations with quantifier 
alternations of our running example 1, as property-driven 
induction works well for such problems. A small number 
of instances, however, was solved by RAPIDs¢q but not by 
RAPIDjemmaless Within the time limit, indicating that trace 
lemma reasoning can help to fast-forward proof search. In 
total, RAPID solves 112 benchmarks, whereas SEAHORN and 
DIFFY could respectively prove 17 and 61 problems (with 
mostly universally quantified properties). For more detailed 
experimental data on subsets of these benchmarks we refer to 
[6], [17]. 


IX. CONCLUSION 


We described the RAPID verification framework for proving 
partial correctness of programs containing loops and arrays, 
and its applications towards efficient inductive reasoning and 
invariant generation. Extending RAPID with function calls, and 
automation thereof, is an interesting task for future work. 
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Abstract—Networks are hard to configure correctly, and mis- 
configurations occur frequently, leading to outages or security 
breaches. Formal verification techniques have been applied to 
guarantee the correctness of network configurations, thereby 
improving network reliability. This work addresses verification 
of distributed network control planes, with two distinct contribu- 
tions to improve the scalability of verification. Our first contri- 
bution is a hierarchy of abstractions of varying precision which 
introduce nondeterminism into the procedure that routers use to 
select the best available route. We prove the soundness of these 
abstractions and show their benefits. Our second contribution is 
a novel SMT encoding which uses symbolic graphs to encode all 
possible stable routing trees that are compliant with the given 
network control plane configurations. We have implemented 
our abstractions and SMT encoding in a prototype tool called 
ACORN. Our evaluations show that our abstractions can provide 
significant relative speedups (up to 323x) in performance, and 
ACORN can scale up to ~ 37,000 routers in data center 
benchmarks (with FatTree topologies, running shortest-path 
routing and valley-free policies) for verifying reachability. This 
far exceeds the performance of existing control plane verifiers. 


I. INTRODUCTION 


Bugs in configuring networks can lead to expensive outages 
or critical security breaches, and misconfigurations occur fre- 
quently [1], [2], [3], [4], [5], [6]. Thus, there has been great 
interest in formal verification of computer network configu- 
rations. Many initial efforts targeted the network data plane, 
i.e., the forwarding rules in each router that determine how 
a given packet is forwarded to a destination. Many of these 
methods have been successfully applied in large data centers 
in practice [7], [8], [9]. In comparison, formal verification of 
the network control plane is more challenging. 

Traditional control planes use distributed protocols such 
as OSPF, BGP, and RIP [10] to compute a network data 
plane based on the route announcements received from peer 
networks, the current failures detected, and the router config- 
urations. In control plane verification, one must check that all 
data planes that emerge due to the router configurations are 
correct. There has been much recent progress in control plane 
verification. Fully symbolic SMT-based verifiers [11], [12], 
[13] usually work well for small-sized networks, but have not 
been shown to scale to medium-to-large networks. Simulation- 
based verifiers [14], [15], [16], [13], [17], [18] scale better, 
but in general, do not provide full symbolic reasoning, e.g., 
for considering all external route announcements. Our work 
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is motivated by this gap: we aim to provide full symbolic 
reasoning and improve the scalability of verification. We 
address this challenge with two main contributions — a novel 
hierarchy of control plane abstractions, and a new symbolic 
graph-based SMT encoding for control plane verification. 


Hierarchy of nondeterministic abstractions. Our novel 
control plane abstractions introduce nondeterminism in the 
procedure that routers use to select a route — we call these the 
Nondeterministic Routing Choice (NRC) abstractions. Instead 
of forcing a router to pick the best available route, we allow 
it to nondeterministically choose a route from a subset of 
available routes which includes the best route. The number 
of non-best routes in this set determines the precision of 
the abstraction; our least precise abstraction corresponds to 
picking any available route that is compliant with policy. 

Our main insight here is that determining the best route may 
not be needed for verification of many correctness properties 
that network operators care about, such as reachability (e.g., 
when the number of hops may not matter), valley-freedom, or 
no-transit (Gao-Rexford conditions [19]). On the other hand, 
for policy-based routing, it is still important to model other 
protocol features such as route filters. Our results show, for 
the first time, that nondeterministic routing abstractions can 
successfully verify such properties and provide significant 
gains in performance and scalability. Although some other 
efforts [12], [20] have also proposed to abstract the decision 
process in BGP (details in § VID), we elucidate and study the 
general principle for generic distributed protocols, prove it 
sound, and reveal a range of precision-cost tradeoffs. 

The potential downside of considering non-best routes is 
that our abstractions may lead to false positives, i.e., we could 
report property violations although the best route may actually 
satisfy the property. In such cases, we propose using a more 
precise abstraction that models more of the route selection 
procedure. Our experiments (§VI) demonstrate that the NRC 
abstractions can successfully verify a wide range of networks 
and common policies and offer significant performance and 
scalability benefits in symbolic SMT-based verification. Al- 
though our abstractions are sound for verification of specified 
failures (§IV), we focus on verification without failures here, 
and plan to consider failures in future work. 


Symbolic graph-based SMT encoding. Our novel SMT 
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encoding uses symbolic graphs [21] (where a Boolean variable 
is associated with each edge in the network topology) to model 
the stable states of a network control plane. Our encoding can 
leverage specialized SMT solvers such as MonoSAT [21] that 
provide support for graph-based reasoning, as well as standard 
SMT solvers such as Z3 [22]. 


Experimental evaluation. We have implemented our NRC 
abstractions and symbolic graph-based SMT encoding in a 
prototype tool called ACORN (Abstracting the COntrol plane 
using Route Nondeterminism). We present a detailed evalua- 
tion on benchmark examples that include synthetic data center 
examples with FatTree topologies [23], as well as real topolo- 
gies from Topology Zoo [24] and BGPStream [25] running 
well-known network policies, where we verify reachability 
and other properties of interest. All benchmark examples 
are successfully verified using an NRC abstraction (96% of 
examples with our least precise abstraction, and the remaining 
4% using a more precise abstraction). These benchmarks, 
including some new examples that we created, are publicly 
available [26]. ACORN could verify reachability in large Fat- 
Tree benchmarks with about 37,000 nodes (running common 
policies) within an hour. This kind of scalability is needed in 
modern data centers with tens of thousands of routers that run 
distributed routing protocols such as BGP [27]. We compared 
ACORN with two publicly available state-of-the-art control 
plane verifiers on the data center benchmarks, and our results 
show that our tool scales an order of magnitude better. 
To summarize, we make the following contributions: 


1) We present a hierarchy of novel control plane abstrac- 
tions, called the NRC abstractions, that add nondeter- 
minism to a general route selection procedure (§IV). 
We prove our abstractions sound and empirically show 
that they enable a precision-cost tradeoff in verification. 
Although our focus is on SMT-based verification, these 
abstractions could be used with other methods as well. 

2) We present a novel SMT encoding ($V) (based on 
symbolic graphs [21]) to capture distributed control plane 
behavior. This leverages SMT solvers that support graph- 
based reasoning, as well as standard SMT solvers. 

3) We implemented our abstractions and SMT encoding in 
a prototype tool called ACORN and present a detailed 
evaluation (§ VI) on synthetic data center benchmarks and 
real-world topologies with well-known network policies. 


II. MOTIVATING EXAMPLES 


In a distributed routing protocol, routers exchange route 
announcements containing information on how to reach vari- 
ous destinations. On receiving a route announcement, a router 
updates its internal state and sends a route announcement 
to neighboring routers after processing it as per the routing 
configurations. In well-behaved networks, this distributed de- 
cision process converges to a stable state [28] in which the 
internal routing information of each router does not change 
upon receiving additional route announcements. The best route 
selected by each router defines a routing tree: if router u selects 


if c1 then 


if not c1 then 


if c1 then 


if c1 then 
drop route 


(a) Example 1 


(b) Example 2 


Fig. 1: Examples showing correct verification result with an 
NRC abstraction. Red arrows show the routing tree in the real 
network, and green arrows show an additional routing tree 
allowed in the abstraction. 


the route announcement sent by router v for destination d, then 
u will forward data packets with destination d to v. 


Example 1 (Motivating example). Consider the network in 
Figure la (from ShapeShifter [16]) with five routers running 
the Border Gateway Protocol (BGP), described in Appendix A, 
where actions taken by routers are shown along the edges. 


The verification task is to check whether routes announced 
at Rı can reach Rs. The network uses the BGP community 
attribute, a list of string tags, to ensure that R4 prefers to 
route through R3: the community tag cl is added along the 
edge (1, R3), which causes the local preference (Ip) to be 
set to 200 along the edge (R3, R4). Routes with higher local 
preference are preferred (the default local preference is 100). 
Thus, the best route at R4 is through Rg and the corresponding 
routing tree is shown by red (solid) arrows. 

Note that Rs can receive a route even if R4 chooses to route 
through Rə instead, though this route is not the best for R4. 
Thus, Rs can reach the destination regardless of the choice R4 
makes. This observation captures the basic idea in our NRC 
abstractions- intuitively, we explore multiple available routes 
at a node: the best route as well as other routes. Then we 
check if R; receives a route under each of these possibilities. 
Since #5 can reach R; in all routing trees considered by our 
abstraction we correctly conclude that it can reach Rj. 


False positives and refinement. The NRC abstractions are 
sound, i.e., when verification with an abstraction is successful, 
the property is guaranteed to hold in the network. However, 
verification with an abstraction could report a false positive, 
i.e., a property violation even when the network satisfies the 
property. In Figure la, suppose Rs drops routes without the 
tag cl. In the real network, Rs will receive a route, since the 
route sent by R4 has the tag cl. However, verification with an 
abstraction that considers all possible routes would report that 
Rs cannot reach the destination, with a counterexample where 
R, routes through Rə and its route announcement is dropped 
by Rs. Here, an NRC abstraction higher up in the precision 
hierarchy, e.g., one which chooses a route with maximum 
local preference and minimum path length, will verify that 
Rs receives a route, thereby eliminating the false positive. 


Path-sensitive reasoning. Even our least precise abstraction 
can verify many interesting policies due to our symbolic SMT- 
based approach which tracks correlations between choices 
made at different routers, which other tools [16] do not track. 
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SRP instance: SRP = (G, A, aq, <, trans), G = (V, E,d) 
SRP solution: £ : V —> Aq 

Qd ifu=d 
if attrsc(u) = 
if attrsc(u) Æ 
attrsc(u) = {a | (e, a) € choicesz (u)} 


a € attrsc (u), minimal by < 


choices¢ (u) = {(e,a) | e = (v, u), 
a = trans(e, L(v)), a # oo)} 


Fig. 2: Cheat sheet for SRP [30]. 


Example 2 (Path-sensitivity). Figure 1b shows another BGP 
network (from Propane [29]), with seven routers and desti- 
nation Rı. We would like to verify that Ry can reach Rı. 


In the real network, R4 chooses the route from R3 which has 
higher local preference (as shown by red/solid arrows). Under 
the least precise NRC abstraction, R4 could choose the route 
from Rə instead. Regardless of R4’s choice, the community 
tags in the routes received by Rs and Re are the same, and so 
Rz will receive a route either way — our abstraction tracks this 
correlation and correctly concludes that R7 can reach Rj. 


III. PRELIMINARIES 


In this section we briefly cover the background on the 
key building blocks required to describe our technical con- 
tributions. Our NRC abstractions are formalized using the 
Stable Routing Problem (SRP) model [30], [13], a formal 
model of network routing for distributed routing protocols. 
We also briefly describe SMT-based verification using the SRP 
model (e.g., Minesweeper [11]) and support for graph-based 
reasoning in the SMT solver MonoSAT [21]. 


Definition 1 (Stable Routing Problem (SRP) [30]). An SRP 
is a tuple (G, A, aa, <, trans) where G = (V, E, d) is 
a graph representing the network topology with vertices V, 
directed edges Æ, and destination d; A is a set of attributes 
representing route announcements; aq € A denotes the initial 
route sent by d; < C Ax A is a partial order that models the 
route selection procedure (if a; < ag then a, is preferred); 
trans : Ex A, > Ago, where Aœ = AU{oo} and co denotes 
no route, is a transfer function that models the processing of 
route announcements sent from one router to another. 


Figure 2 summarizes the important notions for the SRP 
model [30]. The main difference from routing algebras [31], 
[32] is that the SRP model includes a network topology graph 
G to reason about a given network and its configurations. 


SRP solutions. A solution of an SRP is a labeling function £ : 
V — Ao which represents the final route (attribute) chosen 
by each node when the protocol converges. An SRP can have 
multiple solutions, or it may have none. Any SRP solution 
satisfies a local stability condition: each node selects the best 
among the route announcements received from its neighbors. 


Example 3 (SRP example). The network in Figure la run- 
ning a simplified version of BGP (simplified for pedagogic 


reasons) is modeled using an SRP in which attributes are 
tuples comprising an integer (local preference), a set of bit 
vectors (community tags), and a list of vertices (the path). 
We use a.lp, a.comms, and a.path to refer to the elements 
of an attribute a. The initial attribute at the destination, 
aa = (100, Ø, | ]). The preference relation < models the BGP 
route selection procedure which is used to select the best route. 
The attribute with highest local preference is preferred; to 
break ties, the attribute with minimum path length is preferred 
(more details are in Appendix A). The transfer function for 
edge (Ri, R3) adds the tag cl and prepends R, to the 
path, returning (100,a.comms U {cl}, [Ri] + a.path). The 
transfer function for edge (R3, R4) sets the local preference 
to 200 if the tag cl is present, i.e., if cl E€ a.comms it 
returns (200, a.comms, [R3] + a.path); otherwise, it returns 
(100, a.comms, [R3|+a.path). The transfer function for other 
edges (u,v) prepends u to the path, sets the local preference 
to the default value (100), and propagates the community tags. 


SMT-based verification using SRP. Minesweeper [11] en- 
codes the SRP instance for the network using an SMT formula 
N, such that satisfying assignments of N correspond to SRP 
solutions. To verify if a property encoded as a formula P 
holds, the satisfiability of F = N A >P is checked. If F 
is satisfiable, a property violation is reported. Otherwise, the 
property holds over the network (assuming JN is satisfiable; 
otherwise there are no stable paths). 


SMT with theory solver for graphs. MonoSAT [21] is an 
SMT solver with support for monotonic predicates. A predicate 
p is (positive) monotonic in a variable u if whenever p(...u = 
0...) is true, p(...u = 1...) is also true. Graph reachability 
is a monotonic predicate: if node vı can reach node v2 in 
a graph with an edge removed, it can still reach vg when 
the edge is added. MonoSAT leverages predicate monotonicity 
to provide efficient theory support for graph-based reasoning 
using a symbolic graph, a graph with a Boolean variable per 
edge. Formulas can include these Boolean edge variables as 
well as monotonic predicates such as reachability and max- 
flow. MonoSAT has been used to check reachability in data 
planes in AWS networks [33], [34], but not in control planes, 
as we do in this work. 


IV. NRC ABSTRACTIONS 


We formalize our NRC abstractions as abstract SRP in- 
stances, which are parameterized by a partial order. 


Definition 2 (Abstract SRP). For an SRP S = (G, A, aq, < 
, trans), an abstract SRP Ss is a tuple (G, A, aq, <’, trans), 
where G, A, ag, and trans are defined as in the SRP S, and 
<' C Ax X Ago is a partial order which satisfies 


YB C A, minimal(B, <) C minimal(B, <’) (1) 


where minimal(B, <) = {a € B | fa’ € B.a’ £ada' <a} 
denotes the set of minimal elements of B according to <. 
Condition (1) specifies that for any set of attributes B, the 
minimal elements of B by ~ are also minimal by <’. 
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co 
| 
(100, 15) 
Jen i (100, 15) (100, 10) (100, 5) (200, 10) 
( | ) Partial order ~* 
Better (100, 5) 
(200, 10) 


BGP preference order < 


Fig. 3: Partial orders in concrete and abstract SRPs. 


Note that condition (1) ensures that the solutions (i.e., 
minimal elements) at any node in an SRP are also solutions at 
the same node in the abstract SRP, i.e., the NRC abstractions 
over-approximate the behavior of an SRP. The precision of 
an NRC abstraction depends on the partial order used. Our 
least precise abstraction uses <*, in which any two attributes 
are incomparable and oo is worse than all attributes, and 
corresponds to choosing any available route. The following 
example illustrates solutions of an abstract SRP S2». 


Example 4 (Abstract SRP S). Figure 3 shows Hasse 
diagrams for partially ordered sets comprising simplified BGP 
attributes (pairs with local preference and path length; oo 
denotes no route) at a node u and two partial orders: (1) <, 
the partial order in the standard (concrete) SRP (lifted to A.) 
that models BGP’s route selection procedure (shown on the 
left), and (2) <*, the partial order corresponding to choosing 
any available route (shown on the right). Attributes appearing 
lower in the Hasse diagram are considered better. Hence, in 
the concrete SRP, u will select (200, 10). In the abstract SRP, 
any element that is minimal by <* can be a solution for u so 
u nondeterministically selects an available route. Observe that 
(200, 10), the solution for u in the concrete SRP, is guaranteed 
to be one of the solutions for u in the abstract SRP. This over- 
approximation due to condition (1) ensures that our abstraction 
is sound, i.e., it will not miss any property violations. 


Verification with an NRC abstraction. To verify that a 
property holds in a network using an abstraction ~<’, we 
construct an SMT formula N such that t satisfying assignments 
of N are solutions of the abstract SRP Cn for the network, and 
conjoin it with the negation of an encoding of the property P to 
get a formula F = N/A —P. If F is unsatisfiable, all solutions 
of Su satisfy the property and verification is successful. 
Otherwise, we report a violation with a counterexample (a 
satisfying assignment), and a user can perform refinement 
(described later in this section). Our approach is sound for 
properties that hold for all stable states, i.e., properties of 
the form VL € Sol(S).P(L), where Sol(S) denotes the 
SRP solutions for the network. Like Minesweeper [11], our 
approach only models the stable states of a network and 
cannot verify properties over transient states that arise before 
convergence. 


Lemma 1. [Over-approximation] For an SRP S and cor- 
responding abstract SRP S< with solutions Sol(S) and 
Sol(S_z') respectively, Sol(S) C Sol(S.-). 


Protocol | Partial order | Best route 
<* Any 
OSPF X<(pathcost) min path cost 
~<ospf min path cost, min router ID 
=i Any 
<(ip) max lp (local preference) 
BGP ~<(1p,pl) max lp, min path length 
<(ip,pl, MED) | max Ip, min path length, min 
MED (Multi-exit Discriminator) 
~<bgp max lp, min path length, min 
MED, min router ID 


Fig. 4: Hierarchy of NRC abstractions for OSPF and BGP. 


The proof follows from the definition of SRP solutions and the 
over-approximation condition (1) (full proof in Appendix B). 


Theorem 1. [Soundness] Given SMT formulas N and N 
modeling the abstract and concrete SRPs respectively and 
SMT formula P encoding the property to be verified, if 
N A-—P is unsatisfiable, then N ^ —P is also unsatisfiable. 


The proof follows from Lemma 1 and is shown in Appendix B. 


Verification under failures. We model link failures using oo, 
which denotes no route (device failures are modeled as failures 
of all incident links). Let F denote a set of failed links. Given 
SRP S = (G, A, aq, <, trans), we model network behavior 
under failures F using an SRP Sp = (G, A, aa, <, transp) 
where transp returns oo along edges in F and is the same as 
trans for other edges. We similarly define an abstract SRP for 
Sp, Sap = (G, A, aa, <', trans); it only differs from Sp 
in the partial order <’. Since Lemma 1 holds for an arbitrary 
concrete SRP S, it holds for Sp, i.e., any solution of Sr is also 
a solution of Sa Fr. Hence, the NRC abstractions are sound 
for verification under specified failures. 


Hierarchy of NRC abstractions. The least precise NRC 
abstraction (using <*) does not model the route selection pro- 
cedure at all, and chooses any route. More precise abstractions 
can be obtained by modeling the route selection procedure 
partially. Figure 4 shows partial orders and corresponding 
route selection procedures (shown as steps in a ranking func- 
tion) for OSPF and BGP, ordered from least precise (<*) to 
most precise (<). For example, <(jp,,1) corresponds to the 
first two steps of BGP’s route selection procedure, i.e., it 
first finds routes with maximum local preference, and from 
these, selects one with minimum path length. Appendix A has 
more details of BGP’s route selection procedure. Abstractions 
higher up in the hierarchy are more precise as they model 
more of the route selection procedure but are more expensive 
as their SMT encodings have more variables and constraints. 
This tradeoff between precision and performance is evident in 
our experiments: verification with <(;,) was successful for all 
networks for which verification with <* gave false positives 
(§VI-B), but took up to 2.7x more time. 


Abstraction refinement. If verification with an abstraction 
fails, we use an automated procedure to validate the returned 
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counterexample by checking if each node actually chose the 
best route. The selected routes in the counterexample may 
contain only some fields, depending on the abstraction used. 
We find the values of the other fields and the set of available 
routes by applying the transfer functions along the edges in 
the counterexample, starting from the destination router (i.e., 
by effectively simulating the counterexample on the concrete 
SRP). We then check if all routers selected the best route 
that they received. If this is the case, we have found a real 
counterexample, i.e., a stable solution in the real network that 
violates the property; if not, the counterexample is spurious. 
We can eliminate the spurious counterexample by adding a 
blocking clause that is the negation of the variable assignment 
corresponding to it and repeat verification with the same 
abstraction in a CEGAR [35] loop, but this could take many 
iterations to terminate. Instead, we suggest choosing a more 
precise abstraction which is higher up in the NRC hierarchy. 
We could potentially use a local refinement procedure that uses 
a higher-precision abstraction only at certain routers, based on 
the counterexample. We plan to explore this and other ways of 
counterexample-guided abstraction refinement in future work. 


V. SMT ENCODINGS 


In this section we present our SMT encodings for an 
abstract SRP based on symbolic graphs. SRPs [30] can model 
many distributed routing protocols (e.g., RIP, BGP, etc.) where 
the protocol and configurations determine the partial order 
for route selection and the transfer function. We begin by 
providing definitions for a symbolic graph and its solutions. 


Definition 3 (Symbolic graph [21]). A symbolic graph Gre 
is a tuple (G, RE) where G = (V, E) is a graph and RE = 
{reuv|(u,v) € E} is a set of Boolean routing edge variables. 


Definition 4 (Symbolic graph solutions [21]). A symbolic 
graph Gre = (G,RE) and a formula F over RE has 
solutions Sol(Grr, F) which are subgraphs of G defined 
by assignments to RE that satisfy F, such that an edge (u, v) 
is in a solution subgraph iff reuy = 1 in the corresponding 
satisfying assignment. 


A. Routing Constraints on Symbolic Graphs 


_ We now describe the constraints in our SMT formulation, 
N, of the abstract SRP S. The symbolic graph solutions 
Sol(Gr,N) correspond to solutions of $. The complete 
formulation is summarized in Figure 5. 


e Routing choice constraints: Each node other than the 
destination chooses a neighbor to route through or None, 
which denotes no route (eqn. 2). We use a variable 
nChoice to denote a node’s choice. The routing edge 
T€yy is true iff u chooses a route from v (eqn. 3). 

Route availability constraints: If a node u chooses to 
route through a neighbor v, then v must have a route 
to the destination (eqn. 5). If every neighbor v either 
has no route (=hasRoute,) or the route is dropped 
(routeDropped,y,), then u must choose None (eqn. 6). 


e Attribute transfer and route filtering constraints: If u 
chooses to route through neighbor v (i.e., revu = 1), the 
transfer function relates their attributes and v’s route must 
not be dropped along edge (v,u) (eqns. 7 and 8). The 
attribute at the destination is the initial route aq (eqn. 9). 

Our formulation is parameterized by three placeholders: (1) 

hasRoute,, which is true iff v receives a route from the des- 
tination; (2) transy,, the transfer function along edge (v, u); 
and (3) routeDropped,y,, which is true iff the route is filtered 
along the edge (v, u). Of these, trans,, and routeDroppedyy, 
depend on the network protocol and configuration, and are 
shown in an example below. The encodings of has Route are 
described in the next subsection. 


Example 5 (Transfer constraints). The attribute transfer and 
route filtering constraints in the abstract SRP (with partial 
order <*) are shown below for the network in Figure 1b. 

We only model fields used in route filtering (i.e., the 
community attribute) and ignore local preference and path 
length. We use a bit vector variable comm, to denote the 
community attribute at node FR, and a Boolean routing edge 
variable re„, for each edge (R,,, R,). We encode the presence 
of community tag cl as 1, and its absence as 0. 


Initial route at destination. We set the community attribute 
to 0 at the destination Rı using the constraint comm, = 0. 


Transfer constraints along edge (R1, R3). The transfer func- 
tion adds the community tag cl. The route is never dropped 
along this edge, so the placeholder routeDropped}s3 is false. 


ré13 > comms; = 1 (15) 
re13 > arouteDropped,3 (16) 
routeDroppedız + False (17) 


Our implementation simplifies formulas when routeDropped 
is a constant, and only asserts equation (15) above. 


Transfer constraints along edges (R5, R7) and (Re, R7). 
The transfer functions propagate the community attribute and 
filter routes based on whether tag c1 is present. 


re57 > comm7 = comms (18) 
res7 > arouteDroppeds7 (19) 
routeDroppeds7 <+ (comms = 1) (20) 
reg7 > comm7 = comme (21) 
reg7 > arouteDroppedg7 (22) 
route Droppedg7 <> (comms = 0) (23) 


Transfer constraints along other edges. The transfer func- 
tions propagate the community attribute and do not filter 
routes. 


Te€yy > COMMy = COMMy (24) 
rey, > arouteDroppedyy (25) 
routeDroppedy, + False (26) 


Our implementation simplifies the formulas by substituting the 
value of routeDropped, and only asserts equation (24). 
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Abstract SRP S = (G, A, ag, <', trans), G = (V, E, d) 
Symbolic graph Gre = (G, RE) 


Variables 


route announcement 
Yue V 

neighbor choice, Vu € V \ {d} 
placeholder for route availabil- 
ity, Vue V 

route dropped along an edge, 
V(u,v) € E 


attr, : bit vector fields, 


nChoice, : bit vector 
hasRoute, : Boolean 


routeDropped,, : Boolean 


Constants 


nID(u,v) : integer 
None, : integer 


u’s neighbor ID for v, Y(u,v) € E 
ID denoting no neighbor, Vu € V 


Routing choice constraints 


VV nChoice, = nID(u,v) | V nChoice, = None, 


(v,u)EE 
(2) 
nChoice, = nID(u,v) + Trevu (3) 
reva V(u,d)€ E (4) 
Route availability constraints 
nChoice, = nI D(u,v) > hasRoute, (5) 
nChoice, = None, + 
VAN ~has Route, V routeDroppedyu (6) 
(v,u)EE 
Attribute transfer and route filtering constraints 
Tevu > attru = transy,(attry) (7) 
Teyu > arouteDroppedy,, (8) 
attra = aq (9) 


Solver-specific constraints 


(a) SMT solvers with graph theory support (e.g., MonoSAT): 
Vu € V, hasRoute, © Gre.reaches(d, u) (10) 


(b) SMT solvers without graph theory support (e.g., Z3): 


hasRouteg (11) 
Vu Æ d, hasRoutey, + VV has Route, \ Trey, (12) 
v, (v,u)EE 
ranka = 0 (13) 
V(v,u) € E, revu > ranku = (rank, + 1) (14) 


Fig. 5: Symbolic graph-based encoding for an abstract SRP. 


B. Solver-specific Constraints 


We have two encodings of hasRoute, depending on 
whether the SMT solver has graph theory support. 


SMT solvers with graph theory support. We use the 
reachability predicate Gre.reaches to encode hasRoute: 
hasRoute, is true iff Gee.reaches(d, v) (i.e. there is a path 
from d to v in the symbolic graph G Rg), where d is the destina- 
tion (eqn. 10). Additionally, we use the reachability predicate 
to model regular expressions over paths, which most tools do 
not support. For example, the regular expression “.*ab.*c.*d.*” 
(where ‘” matches any character and ‘*’ denotes 0 or more 
occurrences of the preceding character) matches any path that 
traverses edge (a, b), node c, and then node d, and is encoded 
as Teab \ Gre.reaches(b,c) A Gre.reaches(c, d). 


Standard SMT solvers. We interpret hasRoute as a reach- 
ability marker which indicates whether a route has been 
received and add constraints to propagate the marker in the 
symbolic graph (eqns. 11 and 12). To prevent solutions with 
loops, we use a variable, rank, at each node to track the path 
length along with additional constraints (eqns. 13 and 14). 


Loop prevention. In BGP, routing loops are prevented using 
the AS path attribute, the list of autonomous systems (ASes) 
in the route; routers drop routes if the AS path contains their 
AS. To model BGP’s loop prevention mechanism exactly, 
Minesweeper’s [11] SMT encoding would require O(N?) 
additional variables (where N is the number of routers) to 
track for each router, the set of routers in the AS path. Since 
this is expensive, Minesweeper uses an optimization that relies 
on the route selection procedure to prevent loops when routers 
use default local preference: the shorter loop-free path will 
be selected. Our encodings for hasRoute model BGP’s loop 
prevention mechanism exactly with fewer additional variables: 
the MonoSAT encoding uses no additional variables and the 
Z3 encoding uses O(N) additional variables (rank). 


C. Benefits of the NRC Abstractions in SMT Solving 


Fewer attributes. The most direct benefit is that with NRC 
abstractions many route announcement fields become irrele- 
vant and can be removed from the network model, resulting 
in smaller SMT formulas. Specifically, all fields required to 
model route filtering (i.e., the dropping of route announce- 
ments) and the property of interest are retained, but fields used 
only for route selection (e.g., local preference) can be removed 
depending on the specific abstraction. 


Expensive transfers can be avoided during SMT search. 
Once a neighbor is selected during the SMT search, then 
transfers of attributes from other neighbors become irrelevant. 
In contrast, without any abstraction, each node must consider 
transfers of attributes from all neighbors to pick the best route. 


D. Encoding Properties for Verification 


Reachability. We encode the property that a node u can reach 
destination d by asserting its negation: nChoice, = None. 
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—> Valley-free path c: community attribute (bit vector of width 2) 


ma if c == 0 then c = 1 
Aggi Corg OO else drop route 


if c == 0 then c = 1 
© Aggr+ToR O-Oelse if c == 1 then c = 2 
elsec=3 


ToR +Aggr OO if c != 0 then drop route 


destination 


(a) FatTree topology (b) Valley-free policy 


Fig. 6: Example data center network with a valley-free policy. 


Non-reachability/Isolation. We encode the property that a 
node u can never reach d by asserting nChoice, 4 Noneéy. 


No-transit property. Routing policies between autonomous 
systems (ASes) are typically influenced by business relation- 
ships such as provider-customer or peer-peer [19], [36]. A 
provider AS is paid to carry traffic to and from its customers 
while peer ASes exchange traffic between themselves and 
their customers without any charge. The BGP policies (Gao- 
Rexford conditions [19]) between ASes usually ensure that an 
AS does not carry traffic from one peer or provider to another. 
This is called a no-transit property; its negation is encoded as 
Vuev Vv,wePeerProv(u), PevuATeuw, Where Peer Prov(u) 


vw 
denotes neighbors of u that are its peers or providers. 


Policy properties. BGP policies can be defined by assigning 
meaning to specific community tags. Policy properties can then 
be encoded using formulas over the communities at a node. 


Example 6 (Valley-free Policy). The valley-free policy pre- 
vents paths that have valleys, i.e., paths which go up, down, 
and up again between the layers of a FatTree network topol- 
ogy [23], [29]; a valley-free path is shown in Figure 6a. 
Figure 6b shows an implementation of the valley-free policy 
where c denotes the community attribute in BGP. A path 
between ToR routers with a valley between the Aggr and Core 
layers will cross an Aggr router at least three times, updating 
c to 3. Hence, the negation of the valley-free property at a 
node u is encoded as comm, = 3. 


VI. IMPLEMENTATION AND EVALUATION 


We implemented our abstractions and SMT encodings in 
a prototype tool called ACORN, with backends to MonoSAT 
and Z3 solvers. (The SMT encoding for an abstract SRP ($V) 
is extended for a concrete SRP using additional constraints 
described in Appendix D.) ACORN’s input is an intermediate 
representation (IR) of a network topology and configurations 
(described in Appendix C) which represents routing policy 
using match-action rules, similar to route-maps in Cisco’s con- 
figuration language, and could serve as a target for frontends 
such as Batfish [14] or NV [13] in the future. 

In our evaluation, we measure the effectiveness of the NRC 
abstractions and use two backend SMT solvers — MonoSAT 
and Z3 (with bitvector theory and bit-blasting enabled). We use 
four settings: (1) abs_mono: with NRC abstraction (<*), us- 
ing MonoSAT; (2) abs_z3: with NRC abstraction (<*), using 
Z3; (3) mono: without abstraction, using MonoSAT; (4) z3: 


without abstraction, using Z3. We evaluated ACORN on two 
types of benchmarks: (1) data center networks with FatTree 
topologies [23] (a commonly used topology), and (2) wide area 
networks from Topology Zoo [24] and BGPStream [25] (more 
details are in Appendix C). We also compared ACORN with 
two state-of-the-art control plane verifiers on the data center 
benchmarks. All experiments were run on a Mac laptop with 
a 2.3 GHz Intel i7 processor and 16 GB memory. 


A. Data Center Networks 


We generated data center network benchmarks with FatTree 
topologies [23], with 125 to 36,980 nodes running four poli- 
cies: (1) shortest-path routing policy, (2) valley-free policy, 
(3) an extension of the valley-free policy with an isolation 
property — it uses regular expressions to enforce isolation 
between a FatTree pod and an external router connected to 
the core routers, and (4) a buggy valley-free policy in which 
routers in the last pod cannot reach routers in other pods. 
We checked reachability for all policies, and a policy-based 
property for (2) and (3). The results are shown in Figure 7, 
with each graph showing the number of nodes on the x-axis 
and the verification time (in seconds) on the y-axis. 

Our results show that for all data center examples, and with 
both solvers, using the NRC abstraction is uniformly better 
than using the no-abstraction setting. With the MonoSAT 
solver, the NRC abstraction can achieve a relative speed-up 
of 52x for verifying reachability (when verification completes 
within a 1 hour timeout). Also, MonoSAT performed better 
than Z3 by up to 10x; leveraging graph-based reasoning was 
clearly beneficial for these examples. Our abstract settings 
successfully verified all properties without any false positives, 
showing that the NRC abstraction can handle realistic policies. 
For networks running the buggy valley-free policy, our tool 
correctly reports that the destination is unreachable (results are 
in Figure 7f). Furthermore, our abstraction is effective even in 
these cases: abs_mono finishes on 3,000 nodes within an hour, 
while both no-abstraction settings time out on 2,000 nodes. 

In terms of scalability, for both solvers, the no-abstraction 
setting times out beyond 4,500 nodes for reachability verifica- 
tion, while the abstract setting scales up to about 37,000 nodes 
for the shortest-path and valley-free policies, and up to 18,000 
nodes for the isolation policy. To the best of our knowledge, 
no other control plane verifier has shown the correctness of 
benchmarks of such large sizes; all prior related work has been 
shown on networks with up to 4,500 nodes (maximum), which 
are much smaller than large data centers in operation today. 


B. Wide Area Networks 


To evaluate ACORN on less regular network topologies than 
data centers, we considered wide area network benchmarks. 
These typically have small sizes and are not easily parameter- 
ized, unlike data center topologies. We evaluated ACORN on 
two sets of wide area networks: (1) 10 of the larger networks 
from Topology Zoo [24], with 22 to 79 routers, which we 
annotated with business relationships (since Topology Zoo 
only provides topologies), and (2) 10 example networks based 
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on parts of the Internet that were involved in misconfiguration 
incidents as reported on BGPStream [25], which we annotated 
with publicly available business relationships (CAIDA AS 
relationships dataset [37]). For all benchmarks, we used a BGP 
policy that implements the Gao-Rexford conditions [19]: (1) 
routes from peers and providers are not exported to other peers 
and providers, and (2) routes from customers are preferred 
over routes from peers, which are preferred over routes from 
providers. We then checked two properties: reachability of all 
nodes to a destination, and the no-transit property (§V-D). 


Topology Zoo benchmarks. The abstract settings successfully 
verify both properties and are up to 3x faster than the respec- 
tive no-abstraction settings. All settings take less than 0.5s for 
both properties (detailed results are in Appendix C). 


BGPStream benchmarks. The results are in Figures 7g to 7j, 
with the number of nodes (ASes) on the x-axis and verification 
time in seconds on the y-axis (log scale). The abstract settings 
successfully verified reachability in 6 networks and gave 
false positives (denoted by triangular markers) for 4; when 
successful, the abstract settings performed much better than 
the no-abstraction settings with relative speedups of up to 
323x for MonoSAT and 3x for Z3. For the no-transit property, 
abs_mono is up to 120x faster than mono, while abs_z3 is 
faster than z3 for some networks but slower for others. 

For the 4 benchmarks with false positives, we used a 
more precise abstraction, <(jp), which models local preference 
(results shown in Figures 7i and 7j). Our <(jp) abstraction is 
successful on all 4 networks, with relative speedups (over no 
abstraction) of up to 133x for MonoSAT and 1.8x for Z3, 
and relative slowdowns (over <*) of up to 2.7x for MonoSAT 
and 1.5x for Z3. These results demonstrate the precision-cost 
tradeoff enabled by the NRC abstraction hierarchy. 


C. Comparison with Existing Tools 


We compared ACORN with two state-of-the-art control 
plane verifiers: ShapeShifter [16] and NV [13] (FastPlane [15] 
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Fig. 8: Comparison of tools on data center examples. 


and Hoyan [18] are not publicly available). ShapeShifter 
uses simulation with abstract interpretation [38], with binary 
decision diagrams (BDDs) [39] representing sets of abstract 
routing messages. NV is a functional programming language 
for modeling and verifying network control planes. It provides 
a simulator (based on Multi-Terminal BDDs [40] but without 
abstraction of routing messages) and an SMT-based verifier 
that uses Z3. (NV’s SMT engine has been shown to perform 
better than Minesweeper [13].) NV uses a series of front-end 
transformations to generate an SMT formula (we only report 
NV’s SMT solving time), but its encoding is not based on 
symbolic graphs. A comparison of our no-abstraction settings 
against NV_SMT gives some indication of the effectiveness 
of our SMT encoding. We performed experiments on the data 
center benchmarks ($VI-A), where we generated correspond- 
ing inputs for ShapeShifter and NV with the same routing 
message fields. The results for the shortest-path routing and 
valley-free policies are shown in Figure 8, with the number 
of nodes shown on the x-axis, verification time in seconds on 
the y-axis (log scale), timeouts denoted by ‘x’, and out-of- 
memory denoted by ‘OOM’. (ShapeShifter and NV could not 
be run on the isolation benchmarks as they do not support 
regular expressions over AS paths.) Note that both NV and 
ShapeShifter run out of memory for networks with more than 
3,000 nodes while ACORN’s mono and abs_mono settings can 
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verify larger networks with 4,500 nodes and 36,980 nodes, 
respectively. These results show that SMT-based methods for 
network control plane verification can scale to large networks 
with tens of thousands of nodes. 


D. Discussion and Limitations 


ACORN is sound for properties that hold for all stable states 
of a network, i.e., properties of the form Vs P(s) where s is 
a stable state, such as reachability, policy-based properties, 
device equivalence, and way-pointing. Like many SMT-based 
tools, ACORN cannot verify properties over transient states 
that arise before convergence. For checking reachability, our 
least precise abstraction works well in practice; to verify a 
property about the path length between two routers, a user 
should use an abstraction that models path length (otherwise 
our verification procedure would give a false positive). We 
have shown that our abstractions are sound under specified 
failures; however, our tool does not yet model failures, which 
we plan to consider in future work. 


VII. RELATED WORK 


Our work is related to other efforts in network verification 
and the use of nondeterministic abstractions for verification. 


Distributed control plane verification. These methods [41], 
[12], [42], [17], [11], [13] aim to verify all data planes that 
emerge from the control plane. Simulation-based tools [14], 
[43], [15] can scale to large networks, but can miss errors 
that are triggered only under certain environments. The FAST- 
PLANE [15] simulator scales to large data centers (results 
shown for ~2000 nodes) but it requires the network policy 
to be monotonic [31] (a route announcement’s preference 
decreases along any edge in the network) while our approach 
does not. HOYAN [18] uses a hybrid simulation and SMT- 
based approach which tracks multiple routes received at each 
router to check reachability under failures, but in the context of 
the given simulation. The ShapeShifter [16] work is the closest 
to ours in terms of route abstractions, but it does not scale as 
well as our tool (§VI-C). Moreover, our SMT-based approach 
provides better precision by exploring multiple routing choices 
at each node and tracking correlations across different nodes, 
whereas ShapeShifter uses a conservative abstraction at each 
node, much as SMT-based program verification allows path- 
sensitivity for more precision than path-insensitive static anal- 
ysis. For example, ShapeShifter’s ternary abstraction (which 
abstracts each community tag bit to {0, 1, x}) would result in 
a false positive on Example 2 ($ID), while ACORN verifies it 
correctly. Bagpipe [12] verifies BGP policies using symbolic 
execution and uses a simplified BGP route selection procedure 
that chooses routes with maximum local preference, similar to 
our NRC abstraction using <(;p). Our abstraction hierarchy is 
more general and can be applied to any routing protocol. 
ARC [44] and QARC [45] use a graph-based abstraction 
combined with graph algorithms and mixed-integer linear 
programming respectively, but do not support protocol features 
such as local preference and community tags. Tiramisu [46] 
uses a similar graph-based representation, but with multiple 


layers to capture inter-protocol dependencies and was shown 
to scale to networks with a few hundred devices. Bonsai [30] 
compresses the network control plane to take advantage of 
symmetry in the network topology and policy; NRC abstrac- 
tions can be applied even when the network is not symmetric. 

Some recent approaches [47], [48], [20] use modular ver- 
ification techniques to improve the scalability of verification; 
the core ideas in modular verification are orthogonal to our 
work. Among these efforts, LIGHTYEAR [20] also verifies 
BGP policies using an over-approximation that allows routers 
to choose any received route — this corresponds to our NRC ab- 
straction with partial order <*. However, unlike our approach, 
it requires a user to provide suitable invariants. 


Data plane verification. These efforts [49], [50], [51], [52], 
[53], [54], [55], [56] model the data forwarding rules and 
check properties such as reachability, absence of routing loops, 
etc. Many such methods have been shown to successfully han- 
dle the scale and complexity of real-world networks. Similar 
to these methods, our least precise abstraction does not model 
the route selection procedure but we verify all data planes that 
emerge from the control plane, not just one snapshot. 


Nondeterminism and abstractions. Nondeterministic ab- 
stractions have been used in many different settings in software 
and hardware verification. Examples include control flow non- 
determinism in Boolean program abstractions in SLAM [57], 
a sequentialization technique [58] that converts control nonde- 
terminism (i.e., interleavings in a concurrent program) to data 
nondeterminism, and a localization abstraction [59] in hard- 
ware designs. Our NRC abstractions use route nondeterminism 
to soundly abstract network control plane behavior. 


VIII. CONCLUSIONS AND FUTURE DIRECTIONS 


The main motivation for our work is to improve the scal- 
ability of symbolic verification of network control planes. 
Our approach is centered around two core contributions: a 
hierarchy of nondeterministic routing choice abstractions, and 
a new SMT encoding that can leverage specialized SMT 
solvers with graph theory support. Our tool, ACORN, has 
verified reachability (an important property for network op- 
erators) on data center benchmarks (with FatTree topologies 
and commonly used policies) with ~37,000 routers, which far 
exceeds what has been shown by existing related tools. Our 
evaluation shows that our abstraction performs uniformly bet- 
ter than no abstraction for verifying reachability for different 
network topologies and policies, and with two different SMT 
solvers. In future work, we plan to consider verification under 
failures, and combine our abstractions with techniques based 
on modular verification of network control planes. 
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APPENDIX A 
BGP OVERVIEW 


BGP is the protocol used for routing between autonomous 
systems (ASes) in the Internet. An autonomous system (AS) 
is a network controlled by a single administrative entity, 
e.g., the network of an Internet Service Provider (ISP) in a 
particular country, or a college campus network. A simplified 
version of the decision process used to select best routes in 
BGP is shown in Table I [36]. A router compares two route 
announcements by comparing the attributes in each row of the 
table, starting from the first row. A route announcement with 
higher local preference is preferred, regardless of the values 
of other attributes; if two route announcements have equal 
local preference, then their path lengths will be compared. 
BGP allows routes to be associated with additional state via 
the community attribute, a list of string tags. Decisions can 
be taken based on the tags present in a route announcement; 
for example, a route announcement containing a particular tag 
can be dropped or the route preference can be altered (e.g., by 
increasing the local preference if a particular tag is present). 


APPENDIX B 
PROOF OF SOUNDNESS OF THE NRC ABSTRACTIONS 


Lemma 1. [Over-approximation] For an SRP S and cor- 
responding abstract SRP S_ with solutions Sol(S) and 
Sol(S_z) respectively, Sol(S) C Sol(S.-). 


Proof. We need to show that for each labeling £, if £ € 
Sol(S) then £L € Sol(S_,). An SRP solution £ is defined by 


ad ifu=d 
if attrsc(u) = 0 
if attrsc(u) 40 


where attrsc(w) is the set of attributes that u receives from its 
neighbors. The abstract SRP S differs from the SRP S' only 
in the partial order. Therefore, to show that £ is a solution 
of Sx, we need to show that if attrsc(u) # 0, then £L(u) is 
minimal by <’. By the definition of an abstract SRP, the set 
of minimal attributes according to <’ is a superset of the set 
of minimal attributes according to <, which means L(u) is 
minimal by <’. Therefore, any SRP solution £ is a solution 
of the abstract SRP Sx. 


Theorem 1. [Soundness] Given SMT formulas N and N 
modeling the abstract and concrete SRPs respectively and 
SMT formula P encoding the property to be verified, if 
N A -—P is unsatisfiable, then N A —P is also unsatisfiable. 


a € attrsc(u) , minimal by < 


Proof. If NAWP is unsatisfiable, every solution of the abstract 
SRP satisfies the given property. By Lemma 1, the property 
also holds for all solutions of the concrete SRP S, i.e., there 
is no property violation in the real network. 


APPENDIX C 
ACORN INTERMEDIATE REPRESENTATION (IR) AND 
BENCHMARK EXAMPLES 
Intermediate Representation (IR). Our IR represents a 
transfer function as a list of match-action rules, similar to 


c: community attribute (bit vector of width 2) 


lc = 0: Customer, c = 1: Peer, c = 2: Provider | 


c=2 
Ip = 100 


Provider Customer 


if c != 0 then drop route 


Customer Provider else c = 0; Ip = 300 


if c != 0 then drop route 


else c = 1; Ip = 200 


Peer Peer 


Fig. 9: BGP policy implementing Gao-Rexford conditions [19] 


0.357 
—e— ACORN abs_mono 0.30 —e— ACORN abs_mono 
0.30) _. ACORN mono "27 | —e— ACORN mono 
_~ 0.25] —*= ACORN abs z3 __ 0.25] —e— ACORN abs z3 
“v d 


—* ACORN 23 $ 20a ACORN z3 


22 23 27 32 36 41 42 47 70 79 
Network size (# nodes) 


22 23 27 32 36 41 42 47 70 79 
Network size (# nodes) 


(a) Reachability (b) No-transit property 


Fig. 10: Results for Topology Zoo examples. 


route-maps in Cisco’s configuration language. We support 
matching on the community attribute and some types of regular 
expressions over the AS path. Our implementation currently 
supports regular expressions that check whether the path 
contains certain ASes or a particular sequence of ASes, and 
could be extended to support general regular expressions in the 
future. A match can be associated with multiple actions, which 
can update route announcement fields such as the community 
attribute, local preference, and AS path length. 


Benchmark examples. The details of the wide area network 
examples we used (§VI-B) are described below. 

Topology Zoo benchmarks. We used 10 topologies from the 
Topology Zoo [24], which we pre-processed, e.g., by removing 
duplicate nodes and nodes with id “None”. The details of the 
resulting topologies are shown in Table II. We annotated the 
topologies with business relationships, considering each node 
as an AS, and used a BGP policy that implements the Gao- 
Rexford conditions [19] (Figure 9). The annotated benchmark 
files (in GML format) are included in our benchmark reposi- 
tory, along with the examples in our IR format. 

BGPStream benchmarks. We created a set of 10 examples 
based on parts of the Internet involved in BGP hijacking 
incidents, as reported on BGPStream [25]. For each hijacking 
incident, we created a network with the ASes involved and 
used the CAIDA AS Relationships dataset [37] to add edges 
between ASes with the given business relationships (customer- 
provider or peer-peer). We then removed some ASes (if 
required) so that our no-abstraction setting could verify that all 
ASes in the resulting network can reach the destination (taken 
to be the possibly hijacked AS). We used a BGP policy (shown 
in Figure 9) that implements the Gao-Rexford conditions [19]. 
The details of the examples are shown in Table II. 


Results for Topology Zoo examples. Detailed results for the 
Topology Zoo benchmark examples are shown in Figure 10. 
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Step Attribute Description Preference 
(Lower/Higher) 
1 Local preference An integer set locally and not propagated Higher 
2 AS path length The number of ASes the route has passed through Lower 
3 Multi-exit Discriminator (MED) An integer influencing which link should be used between two ASes | Lower 
4 Router ID Unique identifier for a router used for tie breaking Lower 
TABLE I: Simplified BGP decision process to select the best route [36]. 
Benchmark | Topology name Size ; ; 
TZI TN I2 nodes, 24 edges nChoice, # Noney, > VV nV alidyy, A maxLpy = Ipyu 
TZ2 FCCN 23 nodes, 25 edges (v,u)EE 
TZ3 GTS Hungar 27 nodes, 28 edges ‘ ; Sra k 
TZA GTS eae. 32 nodes. 34 ace We define minPath., using similar constraints: 
TZS GRnet 36 nodes, 41 edges . g 
TZ6 RoEduNet 41 nodes, 45 edges VAN (nV alidyy, A py = maxLp,) > minPath, < pathy 
TZ7 LITNET 42 nodes, 42 edges (v,u)E€ E 
TZ8 Bell South 47 nodes, 62 edges : 
TZ9 Tecove 70 nodes, 70 edges nChoicey a Noney > 
TZ10 ULAKNET 79 nodes, 79 edges 


TABLE II: Topology Zoo examples. 


Benchmark | Incident date Size 

Bl 2021-06-14 261 nodes, 3325 edges 
B2 2021-06-17 223 nodes, 2722 edges 
B3 2021-06-18 133 nodes, 1205 edges 
B4 2021-06-19 210 nodes, 2100 edges 
B5 2021-06-21 269 nodes, 3351 edges 
B6 2021-06-22 212 nodes, 2233 edges 
B7 2021-06-22 294 nodes, 4108 edges 
B8 2021-06-22 124 nodes, 860 edges 
B9 2021-06-22 73 nodes, 270 edges 

B10 2021-06-25 154 nodes, 1176 edges 


TABLE III: BGPStream examples. 


APPENDIX D 
SMT CONSTRAINTS FOR CONCRETE SRP 

We extend our abstract SRP formulation (Figure 5) to 
encode a concrete SRP by adding additional constraints en- 
suring that each node picks the best route, i.e., for every edge 
(v, u) € E, if u selects the route from v then v’s route must be 
the best route that u receives from its neighbors. This requires 
keeping track of the attribute fields used in route selection 
(such as path length) and possibly additional variables to 
track the minimum or maximum value of an attribute. The 
constraints required to model the first two steps in BGP’s route 
selection procedure are shown in Example 7. 


Example 7 (Encoding route selection in BGP). We keep track 
of local preference (denoted lp) and AS path length (denoted 
path) and encode transfer constraints over these attributes 
(e.g., to increment path length). For each edge (v, u), we use 
lPuu to denote the local preference of the route sent from v 
to u after applying the transfer function. For each node u we 
use maxLp,, to track the maximum local preference of routes 
node u receives, and minPath,, to track the minimum path 
length among routes with the maximum local preference. 


We define maxLp, below (nValid,, + hasRoute, ^ 
arouteDroppedy, indicates whether v sends a route to u). 


VAN nValidyu > maxLpy > pou 
(v,u)EE 


V NV alidyy A lpyy = mazLp, \ minPath, = pathy 
(v,u)EE 


We then add constraints to ensure that if u chooses a route 
from any neighbor v, then v’s route must be the best. 


nChoice, = nI D(u, v) > pou = maxz Lpy, ^ pathy = minPath, 
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Abstract—We present a new technique for automatically infer- 
ring inductive invariants of parameterized distributed protocols 
specified in TLA*. Ours is the first such invariant inference 
technique to work directly on TLA*, an expressive, high level 
specification language. To achieve this, we present a new al- 
gorithm for invariant inference that is based around a core 
procedure for generating plain, potentially non-inductive lemma 
invariants that are used as candidate conjuncts of an overall 
inductive invariant. We couple this with a greedy lemma invariant 
selection procedure that selects lemmas that eliminate the largest 
number of counterexamples to induction at each round of our 
inference procedure. We have implemented our algorithm in a 
tool, endive, and evaluate it on a diverse set of distributed protocol 
benchmarks, demonstrating competitive performance and ability 
to uniquely solve an industrial scale reconfiguration protocol. 


I. INTRODUCTION 


Automatically verifying the safety of distributed systems 
remains an important and difficult challenge. Distributed pro- 
tocols such as Paxos [32] and Raft [39] serve as the foundation 
of modern fault tolerant systems, making the correctness of 
these protocols critical to the reliability of large scale database, 
cloud computing, and other decentralized systems [47], [8], 
[11], [38]. An effective approach for reasoning about the cor- 
rectness of these protocols involves specifying system invari- 
ants, which are assertions that must hold in every reachable 
system state. Thus, a primary task of verification is proving 
that a candidate invariant holds in every reachable state of 
a given system. For adequately small, finite state systems, 
symbolic or explicit state model checking techniques [12], 
[26], [6] can be sufficient to automatically prove invariants. 
For verification of infinite state or parameterized protocols, 
however, model checking techniques may, in general, be 
incomplete [7]. Thus, the standard technique for proving that 
such a system satisfies a given invariant is to discover an 
inductive invariant, which is an invariant that is typically 
stronger than the desired system invariant, and is preserved 
by all protocol transitions. Discovering inductive invariants, 
however, is one of the most challenging aspects of verification 
and remains a non-trivial task with a large amount of human 
effort required [50], [13], [49], [44]. Thus, automating the 
inference of these invariants is a desirable goal. 
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In general, the problem of inferring inductive invariants for 
infinite state protocols is undecidable [40]. Even the verifica- 
tion of inductive invariants may require checking the validity 
of arbitrary first order formulas, which is undecidable [41]. 
Thus, this places fundamental limits on the development of 
fully general algorithmic techniques for discovering inductive 
invariants. 

Significant progress towards automation of inductive in- 
variant discovery for infinite state protocols has been made 
with the Ivy framework [42]. Ivy utilizes a restricted sys- 
tem modeling language that allows for efficient checking of 
verification goals via an SMT solver such as Z3 [17]. In 
particular, the EPR and extended EPR subsets of Ivy are 
decidable. Ivy also provides an interface for an interactive, 
counterexample guided invariant discovery process. The Ivy 
language, however, may place an additional burden on users 
when protocols or their invariants don’t fall naturally into one 
of the decidable fragments of Ivy. Transforming a protocol 
into such a fragment is a manual and nontrivial task [41]. 

Subsequent work has attempted to fully automate the dis- 
covery of inductive invariants for distributed protocols. State 
of the art tools for inductive invariant inference for distributed 
protocols include 14 [35], fol-ic3 [29], IC3PO [24], SWISS 
[25], and DistAI [51]. All of these tools, however, accept only 
Ivy or an Ivy-like language [2] as input. Moreover, several of 
these tools work only within the restricted decidable fragments 
of Ivy. 

In this paper, we present a new technique for automatic 
discovery of inductive invariants for protocols specified in 
TLA?*, a high level, expressive specification language [33]. To 
our knowledge, this is the first inductive invariant discovery 
tool for distributed protocols in a language other than Ivy. 
Our technique is built around a core procedure for generating 
small, plain (potentially non-inductive) invariants. We search 
for these invariants on finite protocol instances, employing the 
so-called small scope hypothesis [27], [35], [4], circumventing 
undecidability concerns when reasoning over unbounded do- 
mains. We couple this invariant generation procedure with an 
invariant selection procedure based on a greedy counterexam- 
ple elimination heuristic in order to incrementally construct 
an overall inductive invariant. By restricting our inference 
reasoning to finite instances, we avoid restrictions imposed 
by modeling approaches that try to maintain decidability of 
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SMT queries. 

Our technique is partially inspired by prior observations 
[13], [44], [25], [10] that, for many practical protocols, an 
inductive invariant I is typically of the form J = PA 
A, A+- A An, where P is the main invariant (i.e. safety 
property) we are trying to establish, and A;,...,A, are a 
list of lemma invariants. Each lemma invariant A; may not 
necessarily be inductive, but it is necessarily an invariant, and 
it is typically much smaller than J. These lemma invariants 
serve to strengthen P so as to make it inductive. Many prior 
approaches to inductive invariant inference have focused on 
searching for lemma invariants that are inductive, or inductive 
relative to previously discovered information [25], [10], [24], 
[29]. In contrast, our inference procedure searches for plain 
lemma invariants and uses them as candidates for conjuncts of 
an overall inductive invariant. To search for lemma invariants, 
we sample candidates using a syntax-guided approach [20], 
and verify the candidates using an off the shelf model checker. 

We have implemented our invariant inference procedure in 
a tool, endive, and we evaluate its performance on a set of 
diverse protocol benchmarks, including 29 of the benchmarks 
reported in [24]. Our tool solves nearly all of these bench- 
marks, and compares favorably with other state of the art tools, 
despite the fact that all of these tools accept Ivy or decidable 
Ivy fragments as inputs. We also evaluate our tool and other 
state of the art tools on a more complex, industrial scale 
protocol, MongoLoglessDynamicRaft (MLDR) [44]. MLDR 
performs dynamic reconfiguration in a Raft based replication 
system. Our tool is the only one which manages to find a 
correct inductive invariant for MLDR. 

To summarize, in this paper we make the following contri- 
butions: 

e A new technique for inductive invariant inference that 

works for distributed protocols specified in TLA*. 

e A tool, endive, which implements our inductive invariant 
inference algorithm. To our knowledge, this is the only 
existing tool that works directly on TLA*. 

e An experimental evaluation of our tool on a diverse set 
of distributed protocol benchmarks. 

e The first, to our knowledge, automatic inference of an 
inductive invariant for an industrial scale Raft-based 
reconfiguration protocol. 

The rest of this paper is organized as follows. Section II 
presents preliminaries and a formal problem statement. Sec- 
tion III describes our algorithm for inductive invariant infer- 
ence, along with more details on our technique. Section IV 
provides an experimental evaluation of our algorithm, as 
implemented in our tool, endive. Section V examines related 
work, and Section VI presents conclusions and goals for future 
work. 


II. PRELIMINARIES AND PROBLEM STATEMENT 


1) TLA*: Throughout the rest of this paper, we adopt 
the notation of TLA* [33] for formally specifying systems 
and their correctness properties. TLA* is an expressive, high 
level specification language for specifying distributed and 


concurrent protocols. It has also been used effectively in 
industry for specifying and verifying correctness of protocol 
designs [5], [38]. Note that our tool accepts models written in 
TLA*. Figure 1 describes a simple lock server protocol [42], 
[49] in TLA* which we will use as a running example. 


2) Symbolic Transition Systems: The protocols considered 
in this paper can be modeled as parameterized symbolic 
transition systems (STSs), like the one shown in Figure 1. 
This STS is parameterized by two sorts, called Server and 
Client (Line 1). Each sort represents an uninterpreted constant 
symbol that can be interpreted as any set of values. In this 
paper we assume that sorts may only be interpreted over finite 
domains of distinct values e.g. Server = {a1,..., ap} and 
Client = {c,..., cK}. 

In addition to types, a STS also has a set of state variables. 
A state is an assignment of values to all state variables. We 
use the notation s = P to denote that state s satisfies state 
predicate P, i.e., that P evaluates to true once we replace all 
state variables in P by their values as given by s. 

The STS of Figure 1 has two state variables, called locked 
and held (Line 2). The state predicate Init specifies the possible 
values of the state variables at an initial state of the system 
(Lines 3-5). Init states that initially locked|i] is TRUE for all 
i € Server, and that held|i] is {} (the empty set) for all 
i € Client. The predicate Next defines the transition relation 
of the STS (Lines 14-16). In TLA*, Next is typically written 
as a disjunction of actions i.e., possible symbolic transitions. 
In the example of Figure | there are two possible symbolic 
transitions: either some client c and some server s engage in 
a “connect” action defined by the Connect(c, s) predicate, or 
some client c and some server s engage in a “disconnect” 
action defined by the Disconnect(c, s) predicate. 


Given two states, s and s’, we use the notation s — s’ 
to denote that there exists a transition from s to s’, i.e., that 
the pair (s, 5’) satisfies the transition relation predicate Next. 
A behavior is an infinite sequence of states so, 5,,..., such 
that so = Init and si > 5:41 (ie., (si, 8:41) H= Next) for 
all ¿ > 0. A state s is reachable if there exists a behavior 
So, $1,---, Such that s = s; for some i. We use Reach(M) to 
denote the reachable states of a transition system M. 


The entire set of behaviors of the system is defined as a 
single temporal logic formula Spec (Line 17). In TLA*, Spec 
is typically defined as the TLA* formula Init A O[Nezt] vars, 
where O is the “always” operator of linear temporal logic, and 
[Next] vars represents a transition which either satisfies Next 
or is a stuttering step, i.e., where all state variables in Vars 
remain unchanged. 


3) Invariants: In this paper we are interested in the verifi- 
cation of safety properties, and in particular invariants, which 
are state predicates that hold at all reachable states. Formally, 
a state predicate P is an invariant if s = P holds for every 
reachable state s. The model of Figure 1 contains one such 
candidate invariant, specified by the predicate Safe (Line 18). 
Safe states that there cannot be two different clients ci and cj 
which both hold locks to the same server. 


274 


1 CONSTANT Server, Client 
2 VARIABLE locked, held 


Init = 
A locked = [i € Server ++ TRUE] 
A^ held = [i € Client + {}] 


e w 


n 


6 Connect(c, s) = 

7 A locked|s] = TRUE 

8 A held’ = [held EXCEPT ![c] = held[c] U {s} 
9 A locked’ = [locked EXCEPT ![s] = FALSE] 


0 Disconnect(c, s) = 

1 As € held{c] 

2 A held’ = [held EXCEPT ![c] = held[c] \ {s}] 
3 A locked’ = [locked EXCEPT ! [s] = TRUE] 


dc € Client, s € Server : Connect(c, s) 
dc € Client, s € Server : Disconnect(c, s) 


> Init A 


a 


[Nest] tocked, held) 


8 Safe 
9 Vici, G E Client : 
20 (held[ci] A held[c;] 4 {}) > (ci = cj) 


Fig. 1. A simple parameterized protocol defined in TLA*. 


4) Verification: The verification problem consists in check- 
ing that a system satisfies its specification. In TLA*, both 
the system and the specification are written as temporal logic 
formulas. Therefore, expressed in TLA‘, the safety verification 
problem we consider in this paper consists of checking that 
the temporal logic formula 


Spec Safe (1) 


is valid (i.e., true under all assignments). That is, establishing 
that Safe is an invariant of the system defined by Spec. 

5) Finite State Instances: Instantiating a sort means fixing 
it to a finite domain of distinct elements. For example, we 
can instantiate Server to be the set {a1, a2} (meaning there 
are only two servers, denoted a; and ag), and Client to be 
the set {c1, c2} (meaning there are only two clients, denoted 
cı and c2). For the parameterized symbolic transition systems 
considered in this paper, when we instantiate all sorts of an 
STS, the system becomes finite-state, i.e., the set of all possible 
system states is finite. 

6) Inductive Invariants: A standard technique for solving 
the safety verification problem (1) is to come up with an 
inductive invariant [36]. That is, a state predicate Ind which 
satisfies the following conditions: 


Init > Ind (2) 
Ind ^ Next => Ind’ (3) 
Ind => Safe (4) 


where Ind’ denotes the predicate Ind where state variables are 
replaced by their primed, next-state versions. Conditions (2) 
and (3) are, respectively, referred to as initiation and conse- 
cution. Condition (2) states that Jnd holds at all initial states. 


A: Ê Vs € Server: Vc € Client : locked[s] > (s ¢ held[c])) 
Ind = Safe A Ay 


Fig. 2. A lemma invariant, A,, and an inductive invariant, Ind, for the 
protocol and safety property given in Figure 1. 


Condition (3) states that Ind is inductive, i.e., if it holds at 
some state s then it also holds at any successor of s. Together 
these two conditions imply that Ind is also an invariant, i.e., 
that it holds at all reachable states. Condition (4) states that 
Ind is stronger than the invariant Safe that we are trying 
to prove. Therefore, if all reachable states satisfy Ind, they 
also satisfy Safe, which establishes (1). The difficulty is in 
coming up with an inductive invariant which satisfies the above 
conditions. The problem we consider in this paper is to infer 
such an inductive invariant automatically. 

7) Lemma Invariants: An inductive invariant Ind typically 
has the form Ind £ Safe \ A, A++- A Ap, where the 
conjuncts Aj,..., A% are state predicates and we refer to them 
as lemma invariants. Observe that each A; must itself be an 
invariant. The reason is that Ind must be an invariant, i.e., 
must contain all reachable states, and since Ind is stronger 
than (i.e., contained in) each A;, each A; must itself contain all 
reachable states. Furthermore, although all lemma invariants 
must be invariants, they need not be individually inductive. 
However, the conjunction of all lemma invariants together with 
the safety property Safe must be inductive. Figure 2 provides 
an example of an inductive invariant, Ind, for the protocol 
and safety property given in Figure 1. Jnd contains a single 
lemma invariant, Aj. 

&) Counterexamples to Induction: Given a state predicate 
P (which is typically a candidate inductive invariant), a 
counterexample to induction (CTI) is a state s such that: (1) 
s = P; and (2) s can reach a state satisfying ~P in k 
steps, i.e. there exist transitions s > s1 —> s2 4 ++: > Sk 
and są = =P. That is, a CTI is a state s which proves 
that P is not inductive i.e., not “closed” under the transition 
relation. We denote the set of all CTIs of predicate P by 
CTIs(P). Note that for any inductive invariant Ind, the set 
CTIs(Ind) is empty. Given another state predicate Q and a 
state s € CTIs(P), we say that Q eliminates s if s / Q, i.e., 
if s= =Q. 


III. OUR APPROACH 


At a high level, our inductive invariant inference method 
consists of the following steps: 


1) Generate many candidate lemma invariants, and store 
them in a repository that we call Invs. 

2) Generate counterexamples to induction for a current 
candidate inductive invariant, Ind. If we cannot find any 
such CTIs, return Ind. 

3) Select lemma invariants from /nvs so that all CTIs are 
eliminated. If we cannot eliminate all CTIs, either give 
up, or go to Step 1 and populate the repository with more 
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Inputs 


Spec Safe 


o 
Toa Tra IC Invs 
new 


D 


Invariant CTI 
Generator Eliminator 


> Output 


Jens 


CTI 
Generator 


Fig. 3. Components of our technique for inductive invariant inference. 


Algorithm 1 Our inductive invariant inference algorithm. 
1: Inputs: 
M: Finite instance of a parameterized STS 
Safe: Candidate invariant 
Invs: Lemma invariant repository (typically empty initially) 
G: Grammar for invariant generation 


2: procedure INFERINDUCTIVEINVARIANT(M, Safe, G, Invs) 
3: Ind + Safe 

4: X + GenerateCTIs(M, Ind) 

J: Invs + GenerateLemmalnvariants(M, Invs, G) 

6: while X 4 Ø do 

T if JA € Invs : A eliminates at least one CTI in X then 
8: pick Amar € Jnvs that eliminates the most CTIs from X 
9: Ind 4+ Ind N Amar 

10: X+ X\{s E X:s Amar} 

Ti; else 

12: either goto Line 5 

13: or return (Jnd, “Fail: couldn’t eliminate all CTIs.”) 
14: end if 

15: X + GenerateCTIs(M, Ind) 

16: end while 

17: return (Jnd, “Success: managed to eliminate all CTIs.”) 


18: end procedure 


lemma invariants. Otherwise, add the selected lemma 
invariants to Ind and repeat from Step 2. 


The conceptual approach is illustrated in Figure 3. Our 
detailed algorithm is described in Section I-A. Section III-B 
provides details on our lemma invariant generation procedure, 
Section III-C provides details on CTI generation, and Sec- 
tion I-D describes the selection of lemma invariants. 


A. Inductive Invariant Inference Algorithm 


Our inductive invariant inference algorithm is given in 
pseudocode in Algorithm 1. The algorithm takes as input: 
(1) a finite instance of a symbolic transition system M, (2) 
a candidate invariant (safety property) Safe, (3) a lemma 
invariant repository Jnvs, and (4) a grammar G for gener- 
ating lemma invariant candidates. The use of the grammar is 
discussed further in Section III-B. Jnvs may initially be empty, 
or be pre-populated from previous runs of the algorithm. The 
algorithm aims to discover an inductive invariant, Ind, of the 
form Ind = Safe \ Ay A-++/A An. 

The algorithm maintains a current inductive invariant can- 
didate, Ind, which it initializes to Safe, the safety property 
that we are trying to prove (Line 3). It then generates a set 
X of CTIs of Ind (Line 4). The algorithm may also initialize 


the repository of lemma invariants, Invs, or add more lemma 
invariants to Invs if it is initially non-empty (Line 5). The 
procedures GenerateLemmalnvariants and GenerateCTIs 
are described in more detail below, in Sections III-B and III-C, 
respectively. 

In its main loop, the algorithm tries to eliminate all currently 
known CTIs. As long as the set X of currently known CTIs 
is non-empty, the algorithm tries to find a lemma invariant 
in the Jnvs repository that eliminates the maximal number of 
remaining CTIs possible. If such a lemma invariant exists, the 
algorithm adds it as a new conjunct to Ind (Line 9), removes 
from X the CTIs that were eliminated by the new conjunct 
(Line 10), and proceeds by attempting to generate more CTIs, 
since the updated Ind is not necessarily inductive (Line 15). 

If no lemma invariant exists in the current repository Invs 
that can eliminate any of the currently known CTIs (Line 11), 
then we may either (1) generate more lemma invariants in the 
repository, or (2) give up. The first choice is implemented by 
the goto statement in Line 12. The second choice represents a 
failure of the algorithm to find an inductive invariant (Line 13). 
However, in this case we still return Ind since, even though 
it is not inductive, it may contain several useful lemma 
invariants. These lemma invariants are useful in the sense that 
they might be part of an ultimate inductive invariant. 

If all known CTIs have been eliminated, the algorithm 
terminates successfully and returns Ind (Line 17). Successful 
termination of the algorithm indicates that the returned Ind is 
likely to be inductive. However our method does not provide 
a formal inductiveness guarantee. Ind might not be inductive 
for a number of reasons. First, as we discuss further in 
Section II-C, our CTI generation procedure is probabilistic in 
nature, and therefore GenerateC'TIs might miss some CTIs. 
Second, even if the finite-state instance M explored by the 
algorithm has no remaining CTIs, there might still exist CTIs 
in other instances of the STS, for larger parameter values. 

Even though a candidate invariant returned by a successful 
termination of Algorithm 1 is not formally guaranteed to be 
inductive, we ensure soundness of our overall procedure by 
doing a final check that the discovered candidate inductive in- 
variant is correct using the TLA* proof system (TLAPS) [16]. 
Validation of invariants in TLAPS is discussed further in 
Section I-E. In practice we found that all of the invariants 
generated in our evaluation (Section IV) are correct inductive 
invariants. 

We also remark that in the current version of our algorithm 
and in the current implementation of our tool, we only explore 
the single finite-state instance of the STS provided by the user, 
and we do not attempt to automatically increase the bounds of 
the parameters within the algorithm, as is done for example in 
the approach described in [24]. This is, however, a relatively 
straightforward extension to our algorithm, and would like to 
explore this option in future work. 


B. Lemma Invariant Generation 


For a given finite instance M of a parameterized transition 
system, the goal of lemma invariant generation is to produce 
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(seed) ::= locked[s] | s € held[c] | held[c] = 0 
(quant) := Vs € Server: Vc € Client 

(expr) ::= (seed) | a(expr) | (expr) V (expr) 

(pred) ::= (quant) : (expr) 


Fig. 4. Example of a grammar for lemma invariant generation for the 
lockserver protocol shown in Figure 1. The list of unquantified seed predicates 
and the quantifier template, quant, are provided as user inputs. 


a set of state predicates that are invariants of M. To search for 
these invariants, we adopt an approach similar to other, syntax- 
guided synthesis based techniques [21], [20] for invariant 
discovery. We randomly sample invariant candidates from a 
defined grammar, which is generated from a given set of 
seed predicates. Each seed predicate is an atomic boolean 
predicate over the state variables of the system. Note that the 
parameterized distributed protocols that we consider in this 
paper typically have inductive invariants that are universally 
or existentially quantified over the parameters of the protocol 
or other values of the system state. So, our invariant generation 
technique assumes a fixed quantifier template that is provided 
as input. The provided seed predicates are unquantified predi- 
cates that can contain bound variables that appear in the given 
quantifier template. An example of a simple grammar for the 
protocol of Figure 1 is shown in Figure 4. 

Candidate invariants are produced by generating random 
predicates over the space of seed predicates. Specifically, a 
candidate predicate is formed as a random disjunction of 
seed predicates, where each disjunct may be negated with 
probability 5. The logical connectives {V,—} are functionally 
complete [48], so they serve as a simple basis for generating 
candidate invariants, which we chose to reduce the invariant 
search space. 

For a given set of candidate invariants, C, we check which 
of the predicates in Ç are invariants using an explicit state 
model checker. This can be done effectively due to our use 
of the small scope hypothesis i.e. the fact that we reason only 
about a finite instance M of a parameterized transition system. 
This largely reduces the invariant checking problem to a data 
processing task. Namely: 


(1) Generate Reach(M), the set of reachable states of M. 
(2) Check that s = P for each predicate P € C and each 
s € Reach(M). 


Note that after (1) has been completed once, the set of 
reachable states can be cached and only step (2) must be re- 
executed when searching for additional invariants. 

In theory, the worst case cost of step (2) is proportional to 
|C|-|Reach(M)|. In practice, however, it can often be much 
less costly than this, since once a state violates a predicate 
P, P need not be checked further. Furthermore, both of the 
above computation steps are highly parallelizable, a fact we 
make use of in our implementation, as discussed further in 
Section IV-A. 


We also remark that, in practice, the 
GenerateLemmalnvariants procedure is configured to 
search for candidate invariants of a fixed term size i.e. with a 
fixed or maximal number of disjuncts. In our implementation, 
presented in Section IV-A, we utilize this to search for 
smaller invariants (fewer terms) first, before searching for 
larger ones. That is, we prefer to eliminate CTIs if possible 
with smaller invariants before searching for larger ones. This 
aims to bias our procedure towards discovery of compact 
inductive invariant lemmas. 

Furthermore, since GenerateLemmalnvariants does not 
employ an exhaustive search for invariants over a given space 
of predicates, it accepts a numeric parameter, Niemmas, Which 
determines how many candidate predicates to sample. More 
details of how the concrete values of this parameter are 
configured are discussed in our evaluation, in Section IV. 


C. CTI Generation 


Each round of our algorithm relies on access to a set of 
multiple CTIs, as a means to prioritize between different 
choices of new lemma invariants. To generate these CTIs, we 
use a probabilistic technique proposed in [34] that utilizes 
the TLC explicit state model checker [52]. Given a finite 
instance of a STS M with system states S,, transition rela- 
tion predicate Next, and given candidate inductive invariant 
Ind, the procedure GenerateC'TIs(M, Ind) works by calling 
the TLC model checker. TLC attempts to randomly sample 
states so E S for which there exists a sequence of states 
S1, $2,- -, Sk—1; Sk E S, such that both of the following hold: 

e Vi=0,1,...,4 —1: (si, 5:41) FE Neat A s; H Ind 

e sp Æ Ind. 

The model checker will report this behavior, and all states 
. , Sk&—1 are recorded as counterexamples to induc- 


S0; 51, 52,-- 
tion. 

Due to the randomized nature of this technique, the CTI 
generation procedure requires a given parameter, Netis, that 
effectively determines how many possible states TLC will 
attempt to sample before terminating the CTI generation 
procedure. This is required, since, for systems with sufficiently 
large state spaces, even if finite, sampling all possible states is 
infeasible. Generally, this parameter can be tuned based on the 
amount of compute power available to the tool, or a latency 
tolerance of the user. We discuss more details of this parameter 
and how it is tuned in our experiments in Section IV. 

In practice, during our evaluation we found that TLC was 
able to effectively generate many thousands of CTIs at each 
round of the inference algorithm using the above technique. 
This provided an adequately diverse distribution of CTIs for 
effectively guiding our counterexample elimination procedure, 
which we describe in more detail in Section I-D. Section IV 
presents more detailed metrics on CTI generation as measured 
when testing our implementation on a variety of protocol 
benchmarks. In future we feel it would be valuable to explore 
and compare with other, SMT/SAT based techniques for this 
type of counterexample generation task [18], [30]. 
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D. Lemma Invariant Selection by CTI Elimination 


The task of selecting lemma invariants for use as inductive 
invariant conjuncts is based on a process of CTI elimination, 
as described briefly in Section III-A. That is, CTIs are used 
as guidance for which invariants to choose for new lemma in- 
variants to append to the current inductive invariant candidate. 
Once a sufficiently large set of CTIs has been generated, as 
discussed in Section II-C, we select lemma invariants using a 
greedy heuristic of CTI elimination, which we describe below. 

1) CTI Elimination: Recall that a CTI s is eliminated by 
a state predicate A if s / A. When examining a current set 
of CTIs, X, our algorithm looks for the next lemma invariant 
A € Invs that eliminates the most CTIs in X. The algorithm 
will continue choosing additional lemma invariants according 
to this strategy until all counterexamples are eliminated, or 
until it cannot eliminate any further counterexamples. Each 
selected invariant A; € Invs will be appended as a new 
conjunct to the current inductive invariant candidate i.e. Ind + 
Ind ^ Ai. Once all counterexamples have been eliminated, 
the tool will terminate and return a final candidate inductive 
invariant. This is a simple heuristic for choosing new invariant 
conjuncts that aims to bias the overall inductive invariant 
towards being relatively concise. That is, if we have a choice 
between two alternate lemma conjuncts to choose from, we 
prefer the conjunct that eliminates more CTIs. 

More generally, lemma selection at each round of the 
algorithm can be viewed as a version of the set covering 
problem [15]. Ideally, we would like to find the smallest 
set of lemma invariants that eliminate (i.e. cover) the set of 
CTIs X. Solving this problem optimally is known to be NP- 
complete [28], but we have found a greedy heuristic [14] to 
work sufficiently well in our experiments, the results of which 
are presented in Section IV. In future we would like to explore 
more sophisticated heuristics for lemma selection that take 
into account additional metrics, like syntactic invariant size, 
quantifier depth, etc. 


E. Validation of Inductive Invariant Candidates 


If our inference algorithm terminates successfully, it will 
return a candidate inductive invariant. Since we look for 
invariants on finite protocol instances, though, this candidate 
may not be an inductive invariant for general (e.g. unbounded) 
protocol instances. So, upon termination, we check to see if the 
returned candidate invariant is truly inductive for all protocol 
instances by passing it to an SMT solver. Currently, we use 
the TLA* proof system (TLAPS) [16] for this step, which 
generates an SMT encoding for TLA* [37]. 

For many of the protocols we tested and the invariants 
discovered by our tool, we found that this step was fully 
automated (see Section IV and Table III in the Appendix). 
That is, no user assistance was required to establish validity 
of the discovered invariant. In cases where the underlying 
solver cannot automatically prove the candidate inductive 
invariant, some amount of human guidance can be provided 
by decomposing the proof into smaller SMT queries. We have 
completed this validation step for all of the inductive invariant 


candidates discovered in our experiments, and we confirmed 
that all candidate invariants produced by our tool were indeed 
correct inductive invariants (see Section IV). 


IV. IMPLEMENTATION AND EVALUATION 


A. Implementation and Experimental Setup 


Our invariant inference algorithm is implemented in a tool, 
endive, whose main implementation consists of approximately 
2200 lines of Python code. There are also some optimized 
subroutines which consist of an additional few hundred lines 
of C++ code. Internally, endive makes use of version 2.15 of 
the TLC model checker [52], with some minor modifications 
to improve the efficiency of checking many invariants simul- 
taneously. TLC is used by endive for most of the algorithm’s 
compute intensive verification tasks, like checking candidate 
lemma invariants (Section III-B) and CTI elimination checking 
(Section I-D1). 

For all of the experiments discussed below, endive is con- 
figured to use 24 parallel TLC worker threads for invariant 
checking, 4 parallel threads for CTI generation, and 4 threads 
for CTI elimination. CTI generation and CTI elimination can 
be parallelized further in a straightforward manner, but we 
limit these procedures to 4 parallel threads to simplify certain 
aspects of our current implementation. 

For each benchmark run, we initialize Invs (as explained 
in Algorithm 1) as an empty set and configure the lemma 
invariant generation procedure discussed in Section II-B with 
a parameter value of Niemmas = 15000. The grammars used 
for invariant generation were mined from predicates appearing 
in each protocol specification. 

We configure our CTI generation procedure with a parame- 
ter value of Netis = 50000. Netis does not directly correspond 
to how many concrete CTI states will be generated, but a 
higher value indicates TLC will sample more states when 
searching for CTIs. We also limit the maximum number of 
CTIs returned by each call to the GenerateCTIs procedure to 
10000 states. In theory, generating more CTIs provides better 
counterexample diversity, and is therefore better for our CTI 
elimination heuristics. We impose an upper limit, however, to 
avoid scalability issues in our tool’s current implementation. 
In practice we found this limit sufficient to provide effective 
guidance for lemma invariant selection. 

All of our experiments were run on a 48-core Intel(R) 
Xeon(R) Gold 5118 CPU @ 2.30GHz machine with 196GB 
of RAM. 


B. Benchmarks 


To evaluate endive, we measured its performance on 29 
protocols selected from an existing benchmark set published 
in [24]. We also evaluate endive on an additional, industrial 
scale protocol, MongoLoglessDynamicRaft (MLDR), which is 
a recent protocol for distributed dynamic reconfiguration in a 
Raft based replication system [45], [44]. 
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1) Protocol Conversion: The 29 benchmarks we used from 
[24] were originally specified in Ivy [42], but endive accepts 
protocols in TLA*, so it was necessary to manually trans- 
late the protocols from Ivy to TLA*. There are significant 
differences in how protocols are specified in Ivy and TLA*. 
The underlying approach to modeling systems as discrete 
transition systems, however, by specifying initial states and a 
transition relation, are common between them. In our manual 
translation, we aimed to emulate the original Ivy model as 
close as possible. 

The formal specification for the MongoLoglessDynamicRaft 
protocol (MLDR) was originally written in TLA* [45]. Thus, 
in order to compare with other invariant inference tools which 
accept Ivy as their input language, we had to translate MLDR 
from TLA* into Ivy. This conversion process was highly 
nontrivial due to the significant differences between the Ivy 
and TLA* languages. TLAt is a very expressive language 
that includes integers, strings, sets, functions, records, and 
sequences as primitive data types along with their standard 
semantics. In contrast, the Ivy modeling language, RML 
[42], includes only basic, first order relations and functions. 
For more complex datatypes (e.g. arrays or sequences), their 
semantics must be defined and axiomatized manually. 

An artifact containing all of our source code and instructions 
for reproducing our evaluation results can be found at [43]. A 
public, open-source version of our tool is also available at [1]. 


C. Results 


Our overall results are shown in Table I. We compared 
endive with four recent, state of the art techniques for in- 
ferring invariants of distributed protocols: IC3PO [24], fol-ic3 
[29], SWISS [25], and DistAI [51]. Note that endive accepts 
protocols in TLA*, whereas all other tools accept protocols in 
Ivy or mypyvy. 

The numbers shown for both IC3PO and fol-ic3 in Table I 
are as reported in the evaluation presented in [24], with 
timeouts indicated by a TO entry. For the SWISS results in 
Table I, where possible, we show the runtime numbers reported 
in [25], indicated with a + mark. For the benchmarks in Table I 
that were not tested in [25], we present the results from our 
own runs of the tool, all using default SWISS configuration 
parameters. We ran SWISS both with an invariant template 
matching our own template for endive and also in automatic 
mode, and report the better of the two results. The results for 
DistAI are reported from our runs using the tool in its default 
configuration. For DistAI and SWISS, we report an err result 
in cases where the tool returned an error without producing a 
result. We report a fail result in cases where DistAI or SWISS 
terminated without error but did not discover an inductive 
invariant. In all cases where a benchmark protocol was not 
available in the required input language for the corresponding 
tool, we mark this with an n/a entry. 

For each benchmark result in Table I, we report the total 
wall clock time to discover an inductive invariant in the Time 
column, along with the number of total lemma invariants 
contained in the discovered invariant, including the safety 


endive IC3PO  fol-ic3 SWISS  DistAI 

No. [Protocol Time Inv|Time Inv|Time Inv| Time Inv|Time Inv 

1 |tla-consensus 1 1 0 1 1 1 1 2 2- ii 

2 |tla-tcommit 2 1 ENE? 253 2 8 2 7 

3 |i4-lock-server T2 1-2 Bs? t1 2| en 

4 |ex-quorum-leader-election i: 22. 3 5 24 8 11. ,25 3% 

5 |pyv-toy-consensus-forall 9 3 3 5 11 5 t3 7| er 

6 |tla-simple & 2 6 3} TO 28 8| err 

7 [ex-lockserv-automaton 23 9 7 12) 10 12| fail 2 13 

8 |tla-simpleregular 0 4 8 4) 57 9 65 21| err 

9 |pyv-sharded-kv 312 6] 10 8| 22 10|*4024 2 16 
0 |pyv-lockserv 35 9) 11 12 8 11|13684 2 13 
1 |tla-twophase 43 10] 14 9 9 12 33 24) 29 306 
2 |i4-learning-switch TO 14 10} TO TO 21 32 
3 Jex-simple-decentralized-lock 44 4) 19 15 4 8 1 26 17 
4 |i4-two-phase-commit 69 11| 27 11 8 9 6 15) 17 67 
5 |pyv-consensus-wo-decide 127 8| 50 9| 168 26| t18 8| err 
6 |pyv-consensus-forall 175 8| 99 10/2461 27| 129 9| err 
7 |pyv-learning-switch TO 127 13| TO t959 79 70 
8 |i4-chord-ring-maintenance n/a 229 12| TO İTO 53 164 
9 |pyv-sharded-kv-no-lost-keys 32 3 2 3 2 1 4j fail 

20 |ex-naive-consensus 40 4 6 4 73 18 18 5| fail 

21 |pyv-client-server-ae 46 2 2 2V 877 15 3.°°5|. ert 

22 |ex-simple-election 24 4 7 4; 32 10 9 5| er 

23 |pyv-toy-consensus-epr 9 4 9 4) 70 14 2 4| er 

24 |ex-toy-consensus 7 2) 10 3) 21 8 6 4| err 

25 |pyv-client-server-db-ae 4941 8| 17 6| TO t24 13| err 

26 |pyv-hybrid-reliable-broadcast| n/a 587 4|1360 23| İTO err 

27 |pyv-firewall 38 5 23 7 8 75 5| erm 

28 |ex-majorityset-leader-election| 53 4| 72 7| TO 28 10| err 

29 |pyv-consensus-epr 247 8|1300 9|1468 30 72 10| err 

30 |mldr 2025 6| TO n/a err err 

TABLE I 


DISTRIBUTED PROTOCOL BENCHMARK RESULTS. 


property, in the Znv column. Note that the number of total lem- 
mas in the invariants discovered by SWISS was not reported 
in [25]. Thus, we report the number of lemmas discovered by 
SWISS in our own runs, for the cases where we were able to 
run SWISS successfully to produce an invariant. 

More detailed statistics on the endive benchmark results 
are provided in Appendix A, specifically: the number of 
eliminated CTIs, runtime profiling information, finite instance 
sizes used, and automation level of the TLAPS proofs. 


D. Comparison with Other Tools 


Although Table I relates our approach to several others, we 
note that our tool is not directly comparable to other tools. 
The most fundamental difference is that our tool accepts TLAt 
whereas all other tools in Table I accept Ivy or mypyvy. Fur- 
thermore, some tools work only with the restricted decidable 
EPR or extended EPR fragments of Ivy. To our knowledge, 
this is the case with SWISS and DistAI. As a result, our 
tool is a-priori less automated than other tools, following 
a standard tradeoff between expressivity and automation. In 
practice, however, and despite this theoretical limitation, our 
tool produces a result in most cases, while some of the a-priori 
more automated tools time out or fail. 

Another important difference between the tools of Table I is 
what kind of inductive invariants can be produced by each tool. 
In our case, the user provides the grammar of possible lemma 
invariants as an input to the tool, allowing both universal and 
existentially quantified invariants (Y and 3). DistAI is limited 
to only universally quantified (V) invariants, and SWISS is 
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limited to invariants that fall into the extended EPR fragment, 
though it can learn both universal and existentially quantified 
invariants. Both fol-ic3 and IC3PO attempt to learn the quanti- 
fier structure itself during counterexample generalization, and 
can infer both universal and existentially quantified invariants. 
These tools do not always guarantee, however, that the discov- 
ered invariants will fall into a decidable logic fragment. Thus, 
they provide no explicit guarantee that the overall inference 
procedure will, in general, be fully automated. 


E. Discussion 


Our tool, endive, was able to successfully discover an 
inductive invariant for 25 of the 29 protocol benchmarks from 
[24], and all of the invariants it discovered were proven correct 
using TLAPS. For the two protocols out of these 29 that our 
tool did not solve, pyv-learning-switch and i4-learning-switch, 
this was due to scalability limitations of CTI generation, which 
we believe could be improved with a smarter CTI generation 
algorithm or by incorporating a symbolic model checker [30] 
for this task. 

endive was also able to automatically discover an inductive 
invariant for a key safety property of MLDR, a Raft-based dis- 
tributed dynamic reconfiguration protocol [45]. This protocol, 
reported in Table I as mldr, is a significantly more complex, 
industrial scale protocol [44]. IC3PO was not able to discover 
an invariant for our Ivy model of the MLDR protocol after 
a 1 hour timeout when given the same instance size used in 
the TLA* model given to endive. SWISS and DistAI both 
produced an error when run on our Ivy model of MLDR. 

Generally, the wall clock time taken for endive to discover 
an inductive invariant is of a similar order of magnitude 
to IC3PO. endive even outperforms IC3PO in some cases, 
despite the fact that endive works with TLA* and IC3PO 
works with Ivy. Moreover, in several cases where endive’s 
runtime exceeds that of IC3PO, endive is able to discover 
a smaller inductive invariant (e.g. pyv-lockserv, ex-simple- 
decentralized-lock, pyv-consensus-forall). Additionally, endive 
is often able to discover a considerably smaller invariant 
than tools like DistAI and SWISS. For example, on tla- 
twophase, endive learns an invariant with 10 overall conjuncts, 
whereas SWISS learns a 24 conjunct invariant, and DistAI 
learns a much larger invariant, with over 300 conjuncts. 
endive performs similarly well for the tla-simpleregular and i4- 
two-phase-commit benchmarks. This demonstrates that endive 
compares favorably against other enumerative approaches for 
inductive invariant inference, both in terms of efficiency and 
compactness of invariants, while also working over TLA*, a 
much more expressive input language. 

It is additionally worth noting that our current endive 
implementation is not highly optimized. In particular, the TLC 
model checker, used internally by endive, is implemented 
in Java and interprets TLA* specifications dynamically [31], 
rather than compiling models to a low level, native representa- 
tion as done by tools like SPIN [26]. As a result, TLC may not 
be the most efficient for our inference procedure, and could 
likely be optimized further. 


V. RELATED WORK 


There are several recently published techniques that attempt 
to solve the problem of inductive invariant inference for 
distributed protocols. The IC3PO tool [24], which extended the 
earlier I4 tool [35], uses a technique based on IC3 [10] with 
a novel symmetry boosting technique that serves to accelerate 
1C3/PDR and also to infer the quantifier structure of lemma 
invariants. The fol-ic3 algorithm presented in [29] presents 
another IC3 based algorithm which uses a novel separators 
technique for discovering quantified formulas to separate pos- 
itive and negative examples during invariant inference. SWISS 
[25] is another recent approach that uses an enumerative search 
for quantified invariants while using the Ivy tool to validate 
possible inductive candidates. It relies on SMT based reason- 
ing over an unbounded domain, and does not reason directly 
about finite instances of distributed protocols. DistAI [51] uses 
a similar approach but additionally utilizes a technique of 
sampling reachable protocol states to filter invariants, which 
is similar to our approach of executing explicit state model 
checking as a means to quickly discover invariants. DistAI 
is limited, however, to learning only universally quantified 
invariants. 

In addition to these inductive invariant inference techniques, 
there also exists prior work on alternative techniques for 
parameterized protocol verification. These include approaches 
based on cutoff detection [3], regular model checking [9], and 
symbolic backward reachability analysis [23]. 

More broadly, there exist many prior techniques for the 
automatic generation of program and protocol invariants that 
rely on data driven or grammar based approaches. Houdini 
[22] and Daikon [19] both use enumerative checking ap- 
proaches to discover program invariants. FreqHorn [20] tries 
to discover quantified program invariants about arrays using an 
enumerative approach that discovers invariants in stages and 
also makes use of the program syntax. Other techniques have 
also tried to make invariant discovery more efficient by using 
improved search strategies based on MCMC sampling [46]. 


VI. CONCLUSIONS AND FUTURE WORK 


We presented a new technique for inferring inductive invari- 
ants for distributed protocols specified in TLA* and evaluated 
it on a diverse set of protocol benchmarks. Our approach is 
novel in that: (1) it is the first, to our knowledge, to infer 
inductive invariants directly for protocols specified in TLA* 
and (2) it is based around a core procedure for generating 
plain, not necessarily inductive, lemma invariants. Our results 
show that our approach performs strongly on a diverse set 
of distributed protocol benchmarks. In addition, it is able to 
discover an inductive invariant for an industrial scale dynamic 
reconfiguration protocol. 

In future, our tool can be extended to allow for automatic 
quantifier template search, and further optimizations can be 
made to the lemma invariant generation and selection proce- 
dures. It would be interesting to explore ways in which the 
invariant generation procedure can be guided more directly 
by the generated counterexamples to induction, as a means to 
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prune the search space of candidate invariants more efficiently, 
perhaps using techniques similar to those presented in [46]. 


We 


would also be interested to see if quantifier structures 


can be inferred from the protocol syntax itself. Improving 
the performance of TLC, or experimenting with other, more 
efficient model checkers [26] would be another avenue, since 
model checking performance is a main bottleneck of our 
current approach. 
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APPENDIX A 
DETAILED BENCHMARK RESULTS 


Table II gives a more detailed breakdown of the results 
presented in Table I for our endive invariant inference tool. 
The Check, Elim, and CTIGen columns of Table II indicate, 
respectively, the wall clock time in seconds for (1) checking 


candidate lemma invariants, (2) eliminating CTIs, and (3) 
generating CTIs. The CTIs column indicates the total number 
of eliminated CTIs. 

Recall that we limit the maximum number of generated 
CTIs to 10000 per round, as mentioned in Section IV-A. This 
explains why some protocol results for the endive tool report 
elimination of exactly 10000 CTIs. For example, for the tla- 
twophase benchmark, an inductive invariant was discovered 
in a single round of the algorithm loop (starting at Line 6 of 
Algorithm 1), so no more than 10000 CTIs were generated in 
the entire run. If the benchmark run eliminated greater than 
10000 CTIs, this indicates that it ran for more more than 1 
round. 

Also, for protocols that eliminated O CTIs (e.g. tla- 
consensus, tla-tcommit), this indicates that the starting safety 
property was already inductive. Thus, no CTIs were ever 
generated and no lemma invariants were needed. Similarly, 
some protocols eliminated a nonzero amount of CTIs less 
than 10000 (e.g. ex-quorum-leader-election). This may be the 
case when no more than a single round of the algorithm was 
needed to discover an inductive invariant, or that the number 
of generated counterexamples at each round did not exceed 
10000. Recall that, even within a single round of the algorithm, 
as shown in Algorithm 1, it is possible to discover multiple 
new lemma invariants. 

Additional statistics on the instance sizes used during in- 
variant inference and the degree of automation required for 
TLAPS proofs are shown in Table III. The TLAPS Auto column 
indicates whether the TLAPS proof of the inductive invariant 
discovered by endive was completely automatic (indicated 
with a v), or required some user assistance (indicated with 
a X). 

To provide more fine-grained detail on the level of automa- 
tion for each TLAPS proof, the TLAPS Auto column also 
includes the number of verification conditions in the induction 
check that were proved fully automatically. For a protocol 
with a transition relation of the form Nest = Ti V--- V Tk 
and an inductive invariant candidate Ind = A; A++: A An, 
the consecution check Ind ^A Neat = Ind’ is typically 
the most significant verification burden, and can be trivially 
decomposed into k - n verification conditions (VCs). That is, 
a verification condition Ind A T; = A’, is generated for 
each j € {1,...,k} and i € {1,...,n}, giving k - n total 
VCs. We notate these statistics in the TLAPS Auto column as 
(# VCs proved automatically / k - n total VCs). Protocols that 
were proved fully automatically are shown as (k-n/k-n). The 
Check (s) column also shows the total time in seconds needed 
to check each proof, as measured on a 2020 M1 Macbook Air 
using version 1.4.5 of the TLA+ proof manager. 
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TABLE II 
DETAILED PROFILING RESULTS FOR THE endive RESULTS FROM TABLE I. 


No. | Protocol Time CTIs Check Elim CTIGen 
1 | tla-consensus 1 0 0 0 1 
2 | tla-teommit 2 0 0 0 2 
3 |14-lock-server 7 12 2 2 4 
4 | ex-quorum-leader-election 11 204 2 2 7 
5 |pyv-toy-consensus-forall 19 412 2 2 15 
6 |tla-simple 8 15 2 2 5 
7 |ex-lockserv-automaton 23 3624 6 8 9 
8 | tla-simpleregular 10 1972 3 3 5 
9 | pyv-sharded-kv 312 11715 17 46 249 

0 |pyv-lockserv 35 3654 11 11 13 

1 | tla-twophase 43 10000 10 22 12 

2 |i4-learning-switch TO 

3 | ex-simple-decentralized-lock 44 2035 13. 18 14 
14 |i4-two-phase-commit 69 10408 18 19 33 

5 | pyv-consensus-wo-decide 127 12995 56 39 32 

6 |pyv-consensus-forall 175 10609 63 25 88 

7 |pyv-learning-switch TO 

8 |i4-chord-ring-maintenance n/a 

9 | pyv-sharded-kv-no-lost-keys 13 404 2 2 9 
20 | ex-naive-consensus 40 10000 10 15 16 
21 | pyv-client-server-ae 46 10000 2 4 40 
22 |ex-simple-election 24 551 10 T 8 
23 |pyv-toy-consensus-epr 19 384 8 6 6 
24 |ex-toy-consensus 7 14 2 2 4 
25 | pyv-client-server-db-ae 4941 12546 4657 46 239 
26 | pyv-hybrid-reliable-broadcast | n/a 
27 | pyv-firewall 38 1740 11 22 7 
28 |ex-majorityset-leader-election| 53 10000 12. 15 26 
29 |pyv-consensus-epr 247 16269 80 38 129 
30 |mldr 2025 7751 1272 651 102 
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TABLE III 


ADDITIONAL STATISTICS FOR endive RESULTS REPORTED IN TABLE I. 


No. | Protocol Instance Size TLAPS Auto | Check (s) 
1 |tla-consensus Value={v1,v2,v3} v (A/1) 13 
2 [tla-tcommit RM={rm1,rm2,rm3 } v (2/2) 1 
3 |i4-lock-server Server={s1,s2} v (4/4) 1 
Client={c1,c2} 
4 |ex-quorum-leader-election Node={n1,n2,n3,n4} v (4/4) 1 
5 | pyv-toy-consensus-forall Node={n1,n2,n3} v (6/6) 1 
Value={v1,v2} 
6 | tla-simple N=4 v (4/4) 1 
7 |ex-lockserv-automaton Node={n1,n2,n3} V (45/45) 6 
8 | tla-simpleregular N=3 v (12/12) 1 
9 | pyv-sharded-kv Node={n1,n2,n3} v (18/18) 15 
Key={k1,k2} 
Value={v1,v2} 
10 |pyv-lockserv Node={n1,n2,n3} [V (45/45) 6 
11 | tla-twophase RM={rm1,rm2,rm3} |X (68/70) 18 
12 |i4-learning-switch TO 
T3 [ex-simple-decentralized-lock |Node={n1,n2,n3} v (8/8) 17 
14 |14-two-phase-commit Node={n1,n2,n3} v (TUT) 6 
15 |pyv-consensus-wo-decide Node={n1,n2,n3} X (35/40) 20 
16 | pyv-consensus-forall Node={n1,n2,n3} X (46/48) 25 
17 | pyv-learning-switch TO 
18 |i4-chord-ring-maintenance n/a 
19 |pyv-sharded-kv-no-lost-keys |Node={n1,n2} v (6/6) 12 
Key={k1,k2} 
Value={v1,v2} 
20 | ex-naive-consensus Node={n1,n2,n3} X (11/12) 6 
Value={v1,v2} 
21 | pyv-client-server-ae Node={n1,n2,n3} v (6/6) 2 
Request={rl,r2} 
Response={p1,p2} 
22 |ex-simple-election Acceptor={al,a2,a3} | X (11/12) 5 
Proposer={p1,p2} 
23 | pyv-toy-consensus-epr Node={n1,n2,n3} X (6/8) 9 
Value={v1,v2} 
24 | ex-toy-consensus Node={n1,n2,n3} x (1/4) 1 
Value={v1,v2} 
25 | pyv-client-server-db-ae Node={n1,n2,n3} v (40/40) 20 
Request = {rl,r2,r3} 
Response={p1,p2,p3} 
DbRequestId={ il ,i2} 
26 | pyv-hybrid-reliable-broadcast | n/a 
27 | pyv-firewall Node={n1,n2,n3} X (4/10) 23 
28 | ex-majorityset-leader-election | Node={n1,n2,n3} X (9/12) 9 
29 | pyv-consensus-epr Node={n1,n2,n3} X (39/40) 21 
Value={v1,v2} 
30 |mldr MaxTerm=3 X (15/24) 226 


MaxConfig Version=3 
Server={n1,n2,n3,n4} 
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Abstract—Stateless Model Checking (SMC) is a verification 
technique for concurrent programs that checks for safety violations 
by exploring all possible thread schedulings. It is highly effective 
when coupled with Dynamic Partial Order Reduction (DPOR), 
which introduces an equivalence on schedulings and need explore 
only one in each equivalence class. Even with DPOR, SMC often 
spends unnecessary effort in exploring loop iterations that are pure, 
i.e., have no effect on the program state. We present techniques 
for making SMC with DPOR more effective on programs with 
pure loop iterations. The first is a static program analysis to detect 
loop purity and an associated program transformation, called 
Partial Loop Purity Elimination, that inserts assume statements to 
block pure loop iterations. Subsequently, some of these assumes 
are turned into await statements that completely remove many 
assume-blocked executions. Finally, we present an extension of the 
standard DPOR equivalence, obtained by weakening the conflict 
relation between events. All these techniques are incorporated 
into a new DPOR algorithm, OPTIMAL-DPOR-AwalT, which can 
handle both awaits and the weaker conflict relation, is optimal in 
the sense that it explores exactly one execution in each equivalence 
class, and can also diagnose livelocks. Our implementation in 
NIDHUGG shows that these techniques can significantly speed up 
the analysis of concurrent programs that are currently challenging 
for SMC tools, both for exploring their complete set of interleavings, 
but even for detecting concurrency errors in them. 


I. INTRODUCTION 


Ensuring correctness of concurrent programs is difficult, 
since one must consider all the different ways in which 
actions of different threads can be interleaved. Stateless model 
checking (SMC) [9] is a fully automatic technique for finding 
concurrency bugs (i.e., defects that arise only under some 
thread schedulings) and for verifying their absence. Given a 
terminating program and fixed input data, SMC systematically 
explores the set of all thread schedulings that are possible 
during program runs. A special runtime scheduler drives the 
SMC exploration by making decisions on scheduling whenever 
such choices may affect the interaction between threads. SMC 
has been implemented in many tools (e.g., VeriSoft [10], 
CHESS [20], Concuerror [6], NIDHUGG [2], rInspect [24], 
CDSCHECKER [21], RCMC [14], and GENMC [18]), and 
successfully applied to realistic programs (e.g., [11] and [17]). 

SMC tools typically employ dynamic partial order reduction 
(DPOR) [8, 1] to reduce the number of explored schedulings. 
DPOR defines an equivalence relation on executions, which 
preserves relevant correctness properties, such as reachability 
of local states and assertion violations. For correctness, DPOR 
needs to explore at least one execution in each equivalence 
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P q 
if(x[0] > x[1]) do a := y 
swap(x[0], x[1]); while(a # 1); 
yi if(x[1] > x[2]) 
do b:=y swap(x[1], x[2]); 
while(b # 2); y = 2 


if(x[0] > x[1]) 
swap(x[0], x[1]) 


Figure 1: A concurrent program implementing a sorting network. p sorts x[0] 
and x[1], and then uses y to signal that x[1] is ready. q waits for y to be 1 
and then sorts x[1] and x[2], completing one round of bubble sort. In the 
second round, shown in blue, q signals that the next “generation” of x[1] is 
ready by setting y to 2, upon which p finishes the sort by sorting x[0] and 
x[1] again. Initially y = 0. 


class. We call a DPOR algorithm optimal if it guarantees the 
exploration of exactly one execution per equivalence class. 

In SMC, loops have to be bounded if they do not already 
terminate in a bounded number of iterations. Loop bounding 
may in general not preserve assertion failures. Hence a 
fairly large loop bound should be used, but this is often 
practically infeasible, and thus loop bounding must strike a 
balance between these two concerns. However, for loops whose 
execution has no global effects, the number of equivalence 
classes that need be explored by SMC can be significantly 
reduced while still preserving correctness properties, using 
techniques that we will present in this paper. 

Consider the first round of the program snippet in Fig. 1 
(shown in black), where thread g executes a loop that waits for 
thread p to set the shared variable y to 1. A naive application 
of SMC with DPOR will explore an unbounded number of 
executions, since (in the absence of loop bounding) there is an 
infinite number of equivalence classes, one for each number of 
performed loop iterations. All iterations of this loop, however, 
are pure, i.e., they have no effect on the program state. For 
such loops, a bound of one will preserve correctness properties. 
In our example, the do-while loop of thread q can be rewritten 
into the sequence of statements a := y; assume(a = 1), which 
will cause the SMC exploration to permanently block thread q 
whenever the condition of the assume is violated. 

Using assume statements to bound loops causes executions 
where the condition of the assume is violated and its corre- 
sponding thread is blocked to be explored. This happens even if 
the condition will eventually be satisfied, and the original loop 
will exit, under any fair thread scheduling. Assume-blocking 
of a thread can occur in many contexts, each generating an 
execution that need not be explored. (We will shortly see this 
for the example in Fig. 1.) Furthermore, and perhaps more 
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seriously, this use of assumes prevents SMC from diagnosing 
livelocks in which the loop never exits even under fair thread 
scheduling. This is because a blocked execution corresponding 
to a livelock can also result from a spurious execution in which 
the assume reads a shared variable before it has been written 
to by another thread. 

Here is where await statements can lead to further reductions. 
An await loads from a shared variable, but only if the loaded 
value satisfies some condition, otherwise it blocks. In contrast 
to assume-blocking, await-blocking is not permanent but can be 
repealed if the condition is later satisfied. Thereby, executions 
where blocking occurs by reading “too early” are avoided. 
Moreover, such executions can be distinguished from livelocks, 
in which the condition is not satisfied after some bounded time. 
For our example, the rewrite of the do-while loop into an 
await(y = 1) statement results in a program for which SMC 
would explore only a single execution in which the await reads 
the value written by thread p. 

Consider now the full program in Fig. 1, which performs a 
concurrent sort of a three-element array using a sorting network. 
This program can be scaled to larger arrays for increased 
available parallelism. Since any network sorting an array of 
size n will have at least Q(nlogn) occurrences of a code snippet 
which exchanges two values after exiting a spinloop, exploring 
such a program with SMC will explore Q(2”!°8") executions, 
even after rewriting the spinloops using assume statements. On 
the other hand, when using await statements, all executions 
fall into the same equivalence class. Thus, an optimal SMC 
algorithm that can properly handle awaits will explore only 
one execution, thereby achieving exponential reduction. 

In this paper, we present techniques to (i) automatically 
transform a program to an intermediate representation that 
uses await as a primitive, and (ii) explore its executions 
using a provably optimal DPOR algorithm that is await aware 
and also uses a conflict relation between statements which 
is weaker than the standard one. We first present a static 
program analysis technique to detect pure loop executions 
and an associated program transformation, called Partial Loop 
Purity (PLP) Elimination, that inserts assume statements which 
are then turned into awaits if preceded by the appropriate 
load. We prove that PLP is sound in the sense that it 
preserves relevant correctness properties, including local state 
reachability and assertion failures. We also present and prove 
conditions under which PLP is guaranteed to remove all pure 
executions of a loop. Finally, we prove that our new DPOR 
algorithm OPTIMAL-DPOR-AWaIT, which is an extension of 
the Optimal-DPOR algorithm of Abdulla et al. [1, 3], is correct 
and optimal, also with respect to our weaker conflict relation. 

All these techniques are available in NIDHUGG, a state-of-the- 
art SMC tool, and in the paper’s replication package [13]. Our 
evaluation, using multi-threaded programs which are currently 
challenging for most tools, shows that our techniques can 
achieve significant (and sometimes exponential) reduction 
in the total number of executions that need to be explored. 
Moreover, they enable detection of concurrency bugs which 
were previously out-of-reach for most concurrency testing tools. 


Pil q p 
do a := x y i= 42;| la := x; 2 
while(a Æ 1); || x := 1 assume(a = 1); await(x = 1); 
b :=y b :=y b :=y 


(a) A program with a spinloop. (b) p rewritten with assume.(c) p rewritten with await. 


Figure 2: Multi-threaded program illustrating the rewrites; initially, x = y = 0. 
For (b) and (c), q is the same as in (a). 


II. ILLUSTRATION THROUGH EXAMPLES 


In this section, we illustrate our contributions through 
examples. First, in II-A we show how assume and await 
statements are inserted. In §]-B we illustrate how our optimal 
DPOR algorithm handles await statements, and in §II-C how 
it handles the weaker conflict relation in which atomic fetch- 
and-adds on the same variable are not conflicting. 

We consider programs consisting of a finite set of threads 
that share a finite set of shared variables (x, y, z). A thread has 
a finite set of local registers (a, b, c), and runs a deterministic 
code, built from expressions, atomic statements, and synchro- 
nisation operations, using standard control flow constructs. 
Atomic statements read or write to shared variables and 
local registers, including atomic read-modify-write operations, 
such as compare-and-swap and fetch-and-add. Synchronisation 
operations include locking a mutex and joining another thread. 
Executions of a program are defined by an interleaving of 
statements. We use sequential consistency in this paper, but 
we note that some weak memory models (e.g., TSO and PSO) 
can be modelled by an interleaving-based semantics, so our 
work can be extended to DPOR algorithms [2] that handle such 
memory models. Our loop transformations introduce await 
statements, that take a conditional expression over a global 
variable as a parameter and come in several forms: simple 
awaits (await(x = 0)), load-await (a := await(x = 0)), 
and exchange-await (a := xchgawait(x = 0, := 1)). These 
operations block until their condition is satisfied. 


A. Introducing Await Statements 


Let us show an example of how loops are transformed by 
introducing assume and await statements. Consider the loop 
in Fig. 2a. There, thread p executes a spinloop, waiting for 
thread q to set the shared variable x. Each iteration of this 
loop, in which the value loaded into a is different from 1, is 
pure, i.e., it does not modify shared variables, nor any local 
register that may be used after the end of the loop. Therefore 
an assume statement is introduced at the point where the thread 
can distinguish pure executions from impure ones, i.e., after 
a has been loaded. The result of such a rewrite is shown in 
Fig. 2b. This program has two traces, one in which the assume 
succeeds, representing the executions in which the original loop 
terminates, and one where thread p gets assume-blocked. The 
latter trace will exist even in the case where the original loop is 
guaranteed to terminate under a fair scheduler. This problem is 
remedied by replacing the load into a and the following assume 
statement by an await with a test on the shared variable from 
which a reads. Such a rewrite results in the program in Fig. 2c. 
In this case, the await statement may permanently block only 
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Initially: x = y =0 


xX = 
yrs 
join threads p and q; 
assert(|x - y| < 2) 


2; 
2 


v 
ios 


dic 2 


q2: y i= 


Figure 3: Program with a correctness assertion, and execution trees with the 
first scheduling of the program; nodes show the values of variables x and y. 


if the original loop can livelock under fair scheduling. In our 
simple example, the rewritten program has only a single trace, 
since the original loop is guaranteed to terminate and can be 
replaced by the await. Programs with more complex loops 
(e.g., loops that are pure only along a subset of their paths) 
are also handled by our program transformation (II), but the 
loop is not eliminated when assumes or awaits are introduced. 


B. OPTIMAL-DPOR-AWAIT by Example 


DPOR algorithms are based on regarding executions as 
equivalent if they induce the same ordering between executions 
of conflicting statements. The standard conflict relation regards 
two accesses to the same variable as conflicting if at least 
one is a write. We begin by illustrating the Optimal-DPOR 
algorithm [3] on the simple program in Fig. 3. There two 
threads, p and q, write to two shared variables x and y in 
sequence. Optimal-DPOR starts by exploring an arbitrary 
interleaved execution of the program. Assume it is p1.p2-q1.q2 
as shown in Fig. 3 (we will denote executions by sequences of 
thread identifiers, possibly subscripted by sequence numbers). 
Each explored execution is then analysed to find races, i.e., 
pairs of conflicting events that are adjacent in the happens- 
before order induced by the conflict relation. (An event is a 
particular execution step of a thread in an execution.) Our first 
execution contains two races, (pi,qi) and (p2,q2). For each 
race, Optimal-DPOR creates a so-called wakeup sequence, i.e., 
a sequence which continues the analysed execution up to the 
first event in a way which reaches the second event instead of 
the first event. For the first race, the wakeup sequence is q1, 
and for the second race, it is p1.q1-q2. The wakeup sequences 
are inserted as new branches just before the first event of the 
corresponding race, thereby gradually building a tree consisting 
of the explored executions and added wakeup sequences. The 
execution tree after the first execution is shown in Fig. 3. 

After processing the first execution, Optimal-DPOR then 
picks the leftmost unexplored leaf in the tree, and extends it 
arbitrarily to a full execution, in which races are analysed, etc. 
As the algorithm backtracks, it deletes the nodes it backtracks 
from in the execution tree. The second execution has two 
races, (p1,q1) as well as (p2,q2). However, the corresponding 
wakeup sequences will result in executions that are redundant, 
i.e., equivalent to already inserted ones, so no further insertion 
takes place. The algorithm proceeds in this way until there are 
no more unexplored leafs corresponding to wakeup sequences. 
In total, there are four executions explored by Optimal-DPOR, 
corresponding to the four possible final valuations of x and y. 


[is ef BS Sf qı: await(x = 0) 
1,0 
p2: x := 0 
@.0) Initially: x = y =0 
qi: await(x = a m q 
EnA IT x = 1; await(x = 0); 
d2 WE x := 0 yis g 


Figure 4: Exploration of a program with an await with two satisfying writes. 


pi: xX+:=1 
qı: x+:=1 sı: await(x = 3) 
2,0 
q al 


ryt X+153 pP r E 
xti= 1 xti= 1 x +:= 3 await(x = 3); 


y := 1 


Figure 5: Exploration of a program with fetch-and-adds. Initially, x = y = 0. 


Let us now look at how OPTIMAL-DPOR-AWAIT extends 
Optimal-DPOR to work for programs with awaits. Consider 
the program in Fig. 4. There, p writes to the global variable x, 
first updating it to 1, and then back to 0. Assume that the first 
execution is p1.p2.q1.q2. The analysis of races performed by 
Optimal-DPOR must now be extended to consider that await 
statements are sometimes blocked. First, the conflict between 
p2 with qı will not be handled like a race, since qı is blocked 
just before p2. Therefore, we find the closest preceding point 
in the execution at which qı is not blocked, which in this case 
is at the beginning. We then construct the wakeup sequence q1 
and insert it at the beginning; cf. Fig. 4. Since this program 
only has two traces, OPTIMAL-DPOR-AWAIT will terminate 
after exploring the second execution. 


C. Handling Atomic Fetch-and-Add Instructions in DPOR 


To reduce the number of equivalence classes that need be 
explored by a DPOR algorithm, one can weaken the standard 
conflict relation between statements by considering two atomic 
fetch-and-add (FAA) statements on the same variable as non- 
conflicting if the loaded values are afterwards unused. In the 
absence of await statements, many existing DPOR algorithms 
like Optimal-DPOR handle this definition without modification. 
However, this weakening has a subtle interaction with await 
statements that must be handled by OPTIMAL-DPOR-AwalIT. 

Consider the program in Fig. 5. In this program, three threads, 
p, q, and r, add atomically to the shared variable x, and a thread 
s awaits x having the value 3. We assume that DPOR considers 
the FAA statements pj, q1, and rı to be non-conflicting, but 
conflicting with the statement sı, should it execute. 

Assume that the first explored execution is pj.q1.r1. From 
this point, we cannot substitute sı for either of p1, q1, or 71, as 
sı is not enabled after any of g1.r1, p1.rı Or p1-q1, respectively. 
Yet, there is another execution in which s; is enabled. In 
order to construct this execution, we must not only schedule 
sı before one of the other events, but before two, both of pı 
and qı, so that only rı remains. Then, we could construct the 
wakeup sequence r1.s1. In general, OPTIMAL-DPOR-AWAIT 
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may need to reorder the sequence of independent FAAs that 
precede an await statement and select a subsequence of them, 
in order to unblock the await statement. This can be done 
in several ways, and OPTIMAL-DPOR-AwalIT is optimised to 
avoid enumerating all of them. In §IV-B, we will see how. 


III. PARTIAL LOOP PURITY ELIMINATION 


In this section, we describe Partial Loop Purity Elimination, 
a technique that prevents SMC from exploring executions with 
pure loop iterations. It consists of (1) a static analysis technique 
which annotates programs with conditions under which a loop 
will execute a pure iteration, and (2) a program transformation 
which inserts assume statements based on the analysis. 

We consider loops consisting of a set of basic blocks, with 
a single header block. Each basic block contains a sequence of 
program statements. Blocks are connected via edges, labelled 
by conditions. We also consider program representations on 
Static Single Assignment (SSA) form, which means that each 
register is assigned by exactly one statement. Thus, a register 
uniquely identifies the statement that assigns to it. When the 
value of a register in one block depends on which predecessor 
block was executed, this is expressed using a phi node. For 
example, in a block C with predecessors A and B containing 
registers a and b, respectively, the statement c := @(A:a,B:b) 
defines the register c to get the value of a when the previous 
basic block was A and of b when the previous block was B. 

An execution of a loop iteration is pure if the execution starts 
and ends at the header of the loop, and during the iteration 
(i) no modification of a global variable is performed, (ii) nor 
of any local variable that may be used after the end of the 
iteration, and (iii) no internal (not to the header) backedge 
is taken. In SSA form, modification of local variables can be 
inferred from the phi nodes in the header. If such a phi node 
uses a different value on the backedge to the header than when 
first entering, then the loop iteration modified a local variable 
that is used on some path after the iteration, and we call the 
header impure along the backedge. Our definition considers 
executions that complete inner loop iterations to be non-pure. 
However, our PLP transformation will block inner loops from 
completing pure iterations. 

A register a reaches a program point / if all paths to / pass a’s 
definition. During a loop execution, we say that an expression 
over registers is defined-true at some program point / in the 
loop, if the expression evaluates to true under (i) the current 
valuation of registers that were assigned either outside the loop 
or during the current loop iteration, and (ii) any valuation of 
all other registers. We now define a central concept; that of 
the Forward Purity Condition. 


Definition 1 (Forward Purity Condition). Let l be a program 
point in a loop. Then, a Forward Purity Condition (FPC) at l 
is an expression in Disjunctive Normal Form over the registers 
such that if an execution, without leaving the loop or taking 
an internal backedge, proceeds to a program point l', at which 
the expression is defined-true, then 

(i) the execution from I’ will reach the loop header without 

taking an internal backedge, and 


la > 4] 
a=! 
a:=x; assume(a < 4); 
b:=y a>4 la > 4] 
bi=y 
la > 4] 
a E wat 
az4 [False] 
a<4 z := 42; 
a<4 fa > 4] 


(a) A loop with non-purity 
and conditional branches. 


(b) The loop annotated with FPCs and 
with the assume that is inserted. 


Figure 6: Program snippet illustrating the concepts of the PLP transformation. 
(ii) the execution from l to the loop header will not modify 


any global variables nor any local variable that may be 
used after execution has reached the loop header. 


We will denote a FPC with brackets, for example [c > 42] or 
[False]. A purity condition (PC) of a loop is a FPC of the loop 
at the beginning of its header. Thus, whenever a loop iteration 
passes a program point where the PC is defined-true, and has 
not taken an internal backedge, then that iteration is pure. 

We illustrate these concepts for the program snippet in 
Fig. 6a. In it, the loop loads x and y into registers a and b, 
then branches on the value of a, and along the path where 
a = 4, there is a write to z. Since a write to a global variable 
is non-pure, the loop is not pure whenever a = 4. The two 
paths converge in a common block where a loop condition 
(a > 4) is checked. This loop is pure if (i) it takes the backedge, 
i.e., a > 4 holds, and (ii) the write to z is not performed, i.e., 
a #4 also holds. The conjunction of these conditions, a > 4, 
becomes a purity condition for the entire loop. We thereafter 
insert an assume with the negation of a disjunct of the PC at 
the earliest point that it is defined-true, i.e., after the load of x, 
shown in blue in Fig. 6b. 

Let us now describe the analysis stage for computing purity 
conditions. Its first step is to compute FPCs at all points in 
the loop. Intuitively, the FPC at a point / is a disjunction 
c1 V+- V Cn, Where each c; is a (forward) path condition for 
reaching the header via a pure execution from /. We compute 
FPCs by backwards propagation through statements and basic 
blocks. Let FPC(se) be the FPC immediately after statement 
s, let FPC(es) be the FPC immediately before statement s, 
let FPC(eB) be the FPC at the beginning of block B, and let 
FPC(Be) be the FPC at the end of block B. 

For each statement s, we compute FPC(es) as FPC(se) A g, 
where g is the condition under which s does not update a global 
variable. For instance, g is False for stores, True for loads, 
a =O for an atomic add of form x +:= a, a = b for an atomic 
exchange of form b := xchg(x,a), and c = 1 for an atomic 
compare-exchange of form c := cmpxchg(x,a,b). 

FPCs for basic blocks are computed as follows. First, for an 
edge with condition g from a block A in the loop to a block B, 
let FPC(A,B) be the FPC along that edge, defined as follows; 
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e if B is outside the loop, then FPC(A,B) = [False], 

e if B is the header block, then if B is impure along (A,B), 
then FPC(A,B) = [False], otherwise FPC(A, B) = [g]. 

e if B is inside the loop, then FPC(A,B) = [False] if the 
edge from A to B is an internal backedge (A,B), otherwise 
FPC(A,B) = [FPC(eB) Ag], 

We propagate FPCs backwards through basic blocks by 
the above rules for statements. We then compute the FPC 
at the end of a block A with outgoing arcs to B,,...,B, as 
FPC(Ae) = \/_, FPC(A,B;). We can thereafter calculate FPCs 
for basic blocks by starting from the edges that leave the loop 
or go back to its header. Cycles in the control flow graph are 
no issue, since the FPC of a backedge (A,B) does not depend 
on B. In Fig. 6b, we can see the FPCs computed by the analysis 
on the example. 

After the analysis, we insert assume statements. Given a 
purity condition of form cy Vc2V-::V Cn, for each c; we 
insert an assume(—c;) at the earliest point that is textually 
after the definitions of all registers in c;. For registers that do 
not reach the insertion location, arbitrary values can be used 
when execution does not pass their definitions. Moreover, if 
any memory access along the path corresponding to c; cannot 
be statically determined not to segfault, we must not insert c; 
before that memory access. For this purpose, we associate an 
optional “earliest insertion point” with every c; in each FPC 
computed by the analysis. Finally, to exclude paths that took 
some internal backedge, a “took internal backedge” boolean 
register is introduced, computed by phi-nodes, and included in 
the conjunction c;. 

Theorem 1, whose proof appears in the extended version [12] 
of this paper, states two essential properties of PLP. These 
properties intuitively say that PLP removes pure executions 
while preserving relevant correctness properties. If o is a local 
state occurring in a loop L of a thread p, we say that L is 
unavoidably pure from o to denote that whenever thread p is 
in local state o during an execution, then p is in the process 
of completing a pure iteration of L. 


Theorem 1. Let P’ be the program resulting from applying 
PLP to P. Then P satisfies the following properties. 


1) Local State Preservation: each local state © of a thread p 
which is reachable in P is also reachable in P', provided 
no loop of p is unavoidably pure from o. 

2) Pure Loop Elimination: no execution of P’ exhibits a 
completed pure loop iteration of some thread. 


We remark that in the definition of pure loop iterations, 
we assume possibly conservative characterisations of “global 
variable” and “local variable that may be used after the end of 
the iteration” that can be determined by a standard syntactical 
analysis of the program, and hence used in the PLP analysis. 


IV. THE OPTIMAL-DPOR-AWAIT ALGORITHM 


In this section, we present OPTIMAL-DPOR-AwalT, a 
DPOR algorithm for programs with await statements, which 
is both correct and optimal. Given a terminating program on 


given input, it explores exactly one maximal execution in each 
equivalence class induced by the equivalence relation ~. 


A. Happens-Before Ordering and Equivalence 


DPOR algorithms are based on a partial order on the events 
in each execution. Given an execution E of a program P, an 
event of E is a particular execution step by a single thread; the 
i’th event by thread p is identified by the tuple (p,i), and € 
denotes the thread p of an event e = (p,i). Let dom(E) denote 
the set of events in E. We define a happens-before relation on 
dom(E), denoted ee as the smallest transitive relation such 
that e dks. e! if e occurs before e’ in E, and either 


(i) e and e’ are performed by the same thread, e spawns the 
thread which performs e’, or e’ joins the thread which 
performs e, or 

(ii) e and e’ access a common shared variable x, at least 
one of them writes to x, and they are not both atomic 
fetch-and-add operations. 

Note that the last condition makes atomic fetch-and-add 
operations on the same shared variable independent. It follows 
that p is a partial order on dom(E). We define two 
executions, E and E’, as equivalent, denoted E ~ E’, if they 
induce the same happens-before relation on the same set of 
events, (i.e., dom(E) = dom(E") and H, p= y). IE EYE’, 
then all variables are modified by the same sequence of 
statements, implying that each thread runs through the same 
sequence of local states in E and E’. 


B. The Working of the OPTIMAL-DPOR- AWAIT Algorithm 


OPTIMAL-DPOR-AWAIT is shown in Algorithm 1. It 
performs a depth-first exploration of executions using the 
recursive procedure Explore(E), where E is the currently 
explored execution, which can also be interpreted as the stack 
of the depth-first exploration. In addition, for each prefix E’ 
of E, the algorithm maintains 

e a sleep set sleep(E'), i.e., a set of threads that should not 
be explored from E’, for the reason that each extension of 
form E’.p for p € sleep(E') is equivalent to a previously 
explored sequence, 

e a wakeup tree wut(E’), i.e., an ordered tree (B, <}, where B 
is a prefix-closed set of sequences, whose leaves are called 
wakeup sequences, and < is the order in which sequences 
were added to wut(E’). For each w € B the sequence E’.w 
will be explored during the call Explore(E') in the order 
given by ~. 

All previously explored sequences together with the current 
wakeup tree (i.e., all sequences of form E’.w for w € wut(E’) 
and a prefix E’ of E) form the current execution tree, denoted &. 
The branches of & are ordered by the order in which they were 
added to the tree. Note that the recursive call to Explore(E) 
may insert into wut(E’) for prefixes E’ of E. 

Let v \ p denote the sequence v with the first occurrence of 
an event by thread p (if any) removed. Let nextiz\(p) denote 
the next event performed by thread p after E. Two important 
concepts are races and weak initials. 
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Definition 2 (Non-Blocking Races). Let e,e’ be two events 
in different threads in an execution E, where e occurs before 
e'. Then e and e' are in a non-blocking race, denoted e Xz e, 
if (i) e and e' are adjacent in bai (i.e., e H pe, and for no 
other event e" we have e ae e" Mer e'), and (ii) e' cannot 
be enabled or disabled by an event in another thread. 


Definition 3 (Weak Initials). For an execution E.w, the set 
of weak initials of w (after E), denoted WI\g)(w), is the set 
of threads p such that E.w ~ E.p.(w\ p) if p is in w, and 
E.w.p > E.p.w if p is not in w. 


Intuitively, p € WIjg) (w) if nextiz)(p) is independent with all 
events that precede it in w in the case that p is in w, otherwise 
with all events in w. If p € WIjg) (w) we say that w is redundant 
wrt. E.p, since some extension of E.w is equivalent to some 
extension of E.p. An important property of the execution tree 
& that is maintained by the algorithm is that an extension w 
of an existing sequence E is added only if & does not contain 
an execution of form E’.p such that E’ but not E’.p is a prefix 
of E, and w’.w is redundant wrt. E’.p, where E’ is defined by 
E=E'w’. 

For the OPTIMAL-DPOR-AWAIT algorithm, we define 

e pre(E,e) as the prefix of E up to but not including e, 

e notdep(e,E) as the subsequence of E of events that occur 

after e but do not happen-after e. 
eus jz] w to denote that E.u.v ~ Ew for some v; intuitively 
u is a “happens-before prefix” of w. 


The algorithm runs in two phases: race detection (lines 3—22) 
and exploration (lines 24-33). Exploration picks the next 
unexplored leaf of the exploration tree and extends it with 
arbitrary scheduling to a maximal execution. This leaf is 
reached step-by-step: at each step, the current execution E is 
extended by the leftmost child of the root of wut(E) and used 
in a recursive call to Explore (lines 28-31) in order to perform 
the next step. If wut(E) only contains the empty sequence, 
an arbitrary thread is chosen for the next step and added to 
wut(E) (line 26). This step-by-step extension of the current 
execution is continued until a maximal execution is reached. 
At each step, the new sleep set after E.p is constructed by 
taking the elements of sleep(E) that are independent with p. 
After a recursive call to E.p, the subtree rooted at E.p can be 
removed from the wakeup tree. To remember that we should 


not attempt to explore any sequences that are redundant wrt. 


E.p, we add p to sleep(E). 

The race detection phase is entered when the explored 
sequence E is maximal. There we examine E for races 
and construct new non-redundant executions. We distinguish 
between two types of races: non-blocking races, such as 
between a write and a read, handled on lines 3—6, and blocking 


races, such as involving an await event, handled on lines 7-22. 


For each non-blocking race e g e’, we let E’ be the prefix 
of E that precedes e, and construct a wakeup sequence v by 
appending e’ to the subsequence of events that occur after e 
in E but do not happen-after e (line 5). By construction, the 
sequence E’.y is an execution. Moreover € WIgn(v) since 


Algorithm 1: OPTIMAL-DPOR-AWAIT 
Initial call: Explore(()) with wut(()) = ({()},0), sleep(()) = 0 
1 Explore(E) 


2 | if enabled(E) = 0 then 
3 foreach e,e' € dom(E) such that (e Xz e’) do 
4 let E’ = pre(E,e) 
5 let v= (notdep(e, E).e’) 
6 if sleep(E') N WIjgn (v) =0 then insert(v, E') 
7 foreach (e', E’) € ({(nextjg (p), E)| p is blocked after E} 
8 U {(e',pre(E,e'))| e' is in E and may block}) do 
9 can-stop := False 
10 foreach e in E’ (starting from the end) 
11 that may enable or disable e’ do 
12 let E” = pre(E,e) 
13 let w = notdep(e, E) 
14 if e conflicts with all events that may 
15 enable or disable e’ then can-stop := True 
16 did-insert := False 
17 foreach maximal subsequence u of w such that 
18 u Sem w and e’ is enabled after E”.u do 
19 did-insert := True 
20 let v = u.e' 
21 if sleep(E") When (v) = then insert(v, E") 
22 if can-stop and did-insert then break 
23 | else 
24 if wut(E) = ({()},0) then 
25 choose p € enabled(E) 
26 wut(E) := ({p},0) 
27 while 3p € wut(E) do 
28 let p = min_.{p € wut(E)} 
29 sleep(E.p) := {q € sleep(E) | p,q independent after E} 
30 wut(E.p) := subtree(wut(E), p) 
31 Explore(E.p) 
32 add p to sleep(E) 
33 remove all sequences of form p.w from wut(E) 
34 insert(v, E’) 
35 | u:= () 
36 | let c be the list of children of u in wut(E') from left to right 
37 | foreach sequence u.p in c do 
38 if pE Wier (Vv) then 
39 if p v or (v:=v\ p) = () then return 
40 u:=u.p 
41 if u is a leaf of wut(E’) then return 
42 goto line 36 
43 | add v as a new rightmost descendant of u in wut(E') 


return 


iN 
is 


the occurrence of e’ in v does not happen-after e. Thus, v is 
non-redundant wrt. E’.é. If v is also non-redundant wrt. E’.p 
for each p € sleep(E'), then v is inserted into the wakeup tree 
at E', extending wut(E’) with a new leaf if necessary. 


Races involving events that can be blocked are handled 
at lines 7-22. For each such event e’, we extract the prefix 
E' that precedes e’. Then, for each e in E’ that potentially 
conflicts with e’, we extract the prefix E” preceding e and 
the sequence w of events that does not happen-after e. For 
each maximal happens-before prefix u of w after which e' is 
enabled, we construct a wakeup sequence v as u.e’ (line 20), 
which is checked for redundancy and possibly inserted into 
the wakeup tree in the same way as for a nonblocking race. 
Such prefixes can be enumerated by recursively removing the 
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suffix of one event that may enable or disable e’ at a time, 
stopping whenever e’ is enabled by the current prefix. As an 
optimisation, implemented by the flags can-stop and did-insert, 
once the algorithm has found a wakeup sequence that enables 
e’ before some event that conflicts with every event that may 
enable or disable e’, it needs not consider reversing e’ with 
even earlier events e, as those reversals will be considered in 
a later recursive call. 

The function insert(v,E) for inserting a sequence v into a 
wakeup tree wut(E’) is shown in lines 34—44. Starting from the 
root, represented by the empty sequence, it traverses wut(E’) 
downwards (the current point being u), always descending 
(line 40) to the leftmost child u.p such that p is a weak initial 
of the remainder of v until either (i) arriving at a leaf indicating 
that v was redundant to begin with and wut(E’) can be left 
unchanged (line 41), (ii) encountering a p which is not in v, or 
exhausting v (line 39), or (iii) arriving at a node with no child 
passing the test at line 38, and then adding the remainder of v 
as a new leaf (line 43), since it was shown to be non-redundant. 

Algorithm OPTIMAL-DPOR-AwalIT is correct and optimal 
in the sense that it explores exactly one maximal execution 
in each equivalence class, as stated in the following theorem 
whose proof is in the extended version of this paper [12]. 


Theorem 2. For a terminating program P, 
OPTIMAL-DPOR-AWAIT has the properties that (i) for 
each maximal execution E of P, it explores some execution 
E' with E' ~ E, and (ii) it never explores two different but 
equivalent maximal executions. 


V. IMPLEMENTATION AND EVALUATION 


We have implemented our techniques on top of the NIDHUGG 
tool. NIDHUGG is a state-of-the-art stateless model checker for 
C/C++ programs with Pthreads, which works at the level of 
LLVM Intermediate Representation (IR), typically produced 
by the Clang compiler. We have added our PLP analysis 
and transformations, as well as the rewrite from load-assume, 
exchange-assume, and compare-exchange-assume pairs into 
load-await and exchange-await, as passes over LLVM IR. 
NIDHUGG comes with a selection of SMC algorithms. One of 
them is Optimal-DPOR, which we have used as a basis for our 
implementation of OPTIMAL-DPOR-AwalIT including IFAA, 
the optimisation of treating fetch-and-add instructions to the 
same memory location as independent. All the techniques in 
this paper are now included in upstream NIDHUGG and are 
enabled when giving the -optimal flag. 


A. Overall Performance 


First, we evaluate our technique and compare its performance 
against baseline NIDHUGG and the SAVER [16] technique, 
implemented in a recent version of GENMC [18]. SAVER has a 
similar goal to our PLP transformation, but tries to identify pure 
loop iterations dynamically, aborting threads if they perform a 
pure loop iteration. SAVER’s approach does not allow further 
rewrite with awaits. 

For our evaluation, we used a set of real-world benchmarks 
similar to those used by the SAVER [16] paper. We note that 


all atomic memory accesses in these benchmarks have been 
converted to SC, as this is the only common memory model 
that both tools support. Where relevant, benchmarks are ran 
with the same loop bound as in the SAVER paper. For most 
benchmarks, this is one greater than the number of threads. 
After the benchmark name, the number of threads are shown in 
parentheses. Benchmarks mcslock, qspinlock and seqlock are 
tests of data structures from the Linux kernel. Benchmarks 
ttaslock and twalock are mockups based on, but not the same as, 
the benchmarks in the SAVER paper, because its authors were 
not at liberty to share the original benchmark sources. Both 
are tests of locking algorithms. Benchmark mpmc-queue tests 
a multiproducer-multiconsumer queue algorithm, linuxrwlocks 
tests a readers-writers lock algorithm, treiber-stack tests a 
lock-free stack algorithm, and ms-queue tests a lock-free 
queue. Benchmarks mutex and mutex-musl test two mutex 
algorithms, the second one used in the musl C standard library 
implementation. Benchmark sortnet is an extended version 
of the concurrent sort program from Fig. 1. In this version, 
the sorting networks are generated using Batcher’s odd-even 
mergesort. The number of elements sorted is twice the number 
of threads, so sortnet(6) sorts 12 elements. In our replication 
package [13], all the tools and benchmarks are provided, as 
well as scripts that can replicate the tables in this section. 

We evaluate all techniques based on the number of executions 
they explore. In fact, we show this number using an addition 
of form T +B, where T is the number of explored completed 
executions and B is the number of executions that are blocked 
in the sense that either an await is deadlocked or some thread 
is blocked for executing assume(false) (in NIDHUGG) or a 
pure loop iteration (in SAVER). We remark that the SAVER 
paper reports only the T part, but, as we will see, often the 
number of blocked executions is significant and outnumbers 
the number of explored completed executions. Obviously, both 
numbers contribute to the time an SMC tool takes to explore 
these programs. The evaluation was performed on a Ryzen 
5950X running a July 2022 Arch Linux system. 

In Table I, there are four sets of NIDHUGG columns. Baseline 
shows the performance of unmodified NIDHUGG/Optimal. The 
PLP columns shows the performance of using unmodified NID- 
HUGG/Optimal together with Partial Loop Purity Elimination. 
Pure loops are bounded with assumes. The PLP+Await columns 
shows the result of PLP and transforming assumes into awaits, 
where possible. Finally, the ...+IFAA columns report results 
from when OPTIMAL-DPOR-AwaAIT treats atomic fetch-and- 
add operations as independent. For the two sets of GENMC 
columns, the SAVER columns show the performance of GENMC 
v0.6, which implements the SAVER technique, and the Baseline 
columns show the performance of GENMC v0.5.3, which does 
not. The timeout we have used for these benchmarks is | hour. 

Starting at the top of Table I, qspinlock is a benchmark that 
does not benefit from SAVER nor PLP, but establishes that the 
baseline algorithms of both tools are very similar but GENMC 
is faster. In the next four benchmarks (mcslock, twalock, mutex, 
and mutex-musl), both PLP and SAVER are ineffective, but 
awaits eliminate most of the blocked traces (in mcslock) or 
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Table I: Number of (complete+blocked) executions explored by algorithms implemented in GENMC and NIDHUGG on a set of challenging benchmarks, as 
well as the execution time (in seconds) taken. The © symbol means that the exploration did not finish in 1h, and + means that the tool crashed. 


GENMC NIDHUGG 
Baseline SAVER Baseline PLP PLP+Await ... +IFAA 
Benchmark Execs Time Execs Time Execs Time Execs Time Execs Time Execs Time 
qspinlock(2) 6+2 0.02 6+2 0.02 6+2 0.06 6+2 0.08 6+2 0.08 6+2 0.09 
qspinlock(3) 564+462 0.06 564+462 0.06 564+462 0.20 564+462 0.20 564+456 0.21 564+456 0.20 
meslock(3) 3364426 0.09 3364426 0.09 3364426 0.20 3364426 0.23 336+72 0.18 336+72 0.18 
meslock(4) 26232+33432 42.06 26232+33432 3.95 26232+33432 16.59 26232+33432 16.95 26232+4824 9.53 26232+4824 9.43 
twalock(3) 96+90 0.02 96+90 0.02 96+90 0.09 96+90 0.09 96 0.08 96 0.08 
twalock(4) 6144+7224 0.35 6144+7224 0.36 6144+7224 1.40 6144+7224 1.45 6144 0.80 6144 0.81 
mutex-musl(2) 20+2 0.02 20+2 0.01 20+2 0.07 20+2 0.07 20 0.06 20 0.07 
mutex-musl(3) 136728+12834 4.74 136728+12834 5.03 25146+93000 11.89 25146+93000 12.04 25146+81972 10.90 14736+36846 5.29 
mutex(2) 12+2 0.02 12+2 0.02 12+2 0.07 12+2 0.07 12 0.07 10 0.07 
mutex(3) 9486+1236 0.35 6582+1188 0.25 9486+1236 1.07 6582+1188 0.84 6582+336 0.76 3618+312 0.44 
ms-queue(3) 925+350 0.13 75+284 0.06 9014+374 = 0.58 901+374 0.58 901+374 0.59 901+374 0.60 
ms-queue(4) 11696504+8399226 2388.57 10662+192438 18.35 © ®© ic) ic) (E) © © © 
linuxrwlocks(3) 38033+31993 3.03 24+59 0.02 38033+31993 6.95 38033+31993 7.24 38033 4.36 3840 0.54 
linuxrwlocks(4) © © 1060+5518 0.22 © © © © © © © © 
ttaslock(3) 162+183 0.02 162+183 0.03 162+183 0.10 36+81 0.08 36 0.07 36 0.07 
ttaslock(4) 20760+29440 1.34 20760+29440 1.46 20760+29440 4.94 576+2308 0.30 576 0.15 576 0.15 
seqlock(3) 147+230 0.04 9+83 0.02 147+230 0.14 9+83 0.10 9+36 0.08 9+36 0.09 
seqlock(4) 87980+105123 19.68 88+2805 0.17 87980+104583 41.58 88+2769 0.44 88+729 0.20 88+729 0.20 
mpmc-queue(3) 11206+11612 1.35 166+987 0.09 11206+8188 3.35 166+840 0.24 166+517 0.20 76+421 0.17 
mpmc-queue(4) ic) © 3970641277783 87.18 ic) © 3970641123234 226.45 39706+360426 88.29 5410+114208 24.15 
treiber-stack(3) 426 0.04 274+80 0.04 426 0.16 274+80 0.14 274+60 0.15 274+60 0.15 
treiber-stack(4) 1546168+9216 217.44 250088+167916 33.17 1546168+9216 403.58 250088+167916 98.24 250088+90896 87.92 250088+90896 88.20 
sortnet(4) t i 1+728 0.33 1+312 0.48 1+312 0.45 1 0.08 1 0.08 
sortnet(5) 7 7 1415231 10.87 1+4517 9.38 144517 9.47 1 0.08 1 0.08 
sortnet(6) t 7 1+163292 140.83 1438285 100.18 1438285 98.82 1 0.08 1 0.08 


all of them (in the remaining three). Moreover, we see that 
IFAA is effective in mutex and mutex-musl, and manages to 
almost halve the total number of executions explored. 

PLP fails to identify the loop purity in ms-queue. The 
restriction on the form of purity conditions imposed by our 
implementation in NIDHUGG is underapproximating the purity 
condition to [False]. This demonstrates a downside with doing 
purity analysis statically, as SAVER never needs to represent 
purity conditions in order to eliminate pure loop iterations. 

In linuxrwlocks, PLP is ineffective, because this benchmark 
does not contain pure loop iterations as we have defined them. 
Rather, the loop contains a pair of fetch-and-add and fetch-and- 
sub that cancel out, which is called a “zero-net-effect” loop 
in the SAVER paper [16]. These are out of scope for a static 
analysis, as SAVER has to dynamically undo the elimination if 
a read appears to have observed the intermediate effect. Despite 
the lack of PLP, OPTIMAL-DPOR-AwalIT significantly speeds 
up linuxrwlocks. 

In ttaslock, we believe some implementation issue is prevent- 
ing SAVER from eliminating pure loop iterations. PLP does 
work, however, and awaits eliminate all the blocked executions. 

In the next three benchmarks (seqlock, mpmc-queue and 
treiber-stack), PLP discovers the same pure loop iterations 
as SAVER, and permits a rewrite to awaits that significantly 
reduces the search space, even by an order of magnitude for 
seqlock, and on mpmc-queue IFAA further halves it. 

Finally, OPTIMAL-DPOR-AwalIT really shines on sortnet. 
GENMC cannot take advantage of awaits, and so has to explore 


an exponential number of (assume-blocked) traces, where 
NIDHUGG can explore the program in just one. Unfortunately, 
GENMC v0.5.3 crashes on this benchmark, but we believe it 
would yield the same numbers as SAVER, which also explores 
a significant number of redundant executions. 


B. Effectiveness on SafeStack 


Next, we evaluate the ability of OPTIMAL-DPOR-AWAIT 
to expose difficult-to-find bugs in real-world code bases. 
The benchmark we will use is called safestack. It was first 
posted to the CHESS forum, and subsequently included in 
the SCTBench [23] and SVComp benchmark suites. The 
original safestack code attempts to implement a lock-free 
stack but contains an ABA bug which is quite challenging 
for concurrency testing and SMC tools to find, in the sense 
that exposing the bug requires at least five context switches. 
The test harness is also quite big, containing three threads 
each performing four operations on the stack. Let us refer 
to this original harness as safestack-444 to indicate that each 
of its three threads performs four operations (pop, push, pop, 
push). We will also use shortened versions of this harness: four 
versions with just two threads, and four versions where each 
of the three threads performs fewer operations. The smallest 
harness that exposes the bug is safestack-331. 

We first compare the two SMC tools and their algorithms on 
versions of safestack that do not exhibit the bug and thus require 
exhaustive exploration of all traces. Table II shows the results. 
First, notice that the dynamic technique that SAVER implements 
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Table Il: Number of (complete+blocked) executions that SMC algorithms in GENMC and NIDHUGG explore on shortened, bug-free versions of safestack. 


GENMC NIDHUGG 

Benchmark Baseline SAVER Baseline PLP PLP+Await .. +IFAA 

safestack-21(2) 119+6 119+6 119+6 34+2 34+1 19+1 
safestack-31(2) 928+107 928+107 928+107 103+27 103425 56+25 
safestack-32(2) 7189+296 7189+296 7189+296 1073+27 1073+12 463+12 
safestack-33(2) 121334+12652 121334+12652 121334+12652 6434+1636 6434+1584 2600+1160 
safestack-211(3) 1267120+325932 995224+325932 1259280+324382 2690+1126 2690+928 962+686 
safestack-311(3) 0+286818740 0+275399108 © 0+26536 0+24078 0+14960 
safestack-321(3) © © © 906529+388117 906529+331337 288057+216830 


is completely or mostly ineffective in these programs; compare 
it to the baseline numbers. In contrast, PLP achieves significant 


reduction of the set of executions that NIDHUGG explores. 


Finally, both the transformation of assumes to awaits and 
the IFAA optimisation are applicable and result in further 
reductions in the number of explored executions. The number 
of complete traces is 0 on safestack-311 since the code does 
not allow popping the last element, so all traces end up with 
one thread livelocking in pop with the queue containing only 
one element. For Table II, the timeout used is 10 hours. 

With our next and last experiment, using safestack-331, we 
can evaluate the tools’ abilities to expose the bug. Neither 
GENMC, with or without SAVER, nor baseline NIDHUGG find 
anything after running for more than 2000 hours! On the 
other hand, if we run NIDHUGG with PLP, awaits, and IFAA, it 
discovers the bug in just 8 minutes, after exploring 2 +2453 474 
traces. How much of its search space an SMC tool has to search 
before it encounters a bug can be up to “luck”, so to ensure that 
this result is not due to luck we “fix” the bug by commenting 
out all the assertions in the benchmark and run NIDHUGG 
again. This gives us an upper bound on the size of the search 
space, i.e., how much would need to be searched to find the bug 
in the worst case, and also provides an indication of how long 
it might take to verify the program after fixing the bug. On the 
fixed safestack-331, NIDHUGG terminates in only 24 minutes 
after exploring 5772 + 8521721 traces. This demonstrates how 
the techniques we presented in this paper substantially reduce 
the search space on safestack, allowing the bug to be found 
or its absence verified by an exhaustive SMC technique. To 
our knowledge, no other exhaustive technique has ever been 
able to discover the bug in safestack. 


VI. RELATED WORK 


Since SMC tools assume the analysed program to terminate, 
they must first bound unbounded loops. Several tools [2, 21, 14, 
15] have an automatic loop unroller that is parameterised by a 
chosen loop bound. Several SMC tools, including NIDHUGG [2], 
RCMC [14] and GENMC [15], transform simple forms of 
spinloops, such as the one shown in Fig. 2a, to assume 
statements, but only transform simple polling loops that can 
be recognised syntactically. We are not aware of any tool that 
transforms loops into await statements, meaning existing tools 
are susceptible to scalability problems for programs like the 
sorting networks shown in Fig. 1. An SMC technique that 
can diagnose livelocks of spinloops under fair scheduling is 
VSYNC [22]. However, to do so it enforces fairness, and cannot 


bound the loop even with an assume, thus exploring many more 
traces than tools which transform spinloops to assumes. 

SAVER [16] also aims to block pure loop iterations by 
introducing assume statements. It identifies pure loop iterations 
dynamically, instead of by static analysis as in our approach. 
SAVER’s approach allows to detect a larger class of pure loop 
iterations, but it does not allow further rewrite with awaits. 
Furthermore, our PLP transformation can block a looping thread 
at any point in the loop, not just at the back edge. SAVER also 
employs several smaller program transformations, such as loop 
rotation and merging of bisimilar control flow graph nodes, 
that can increase the number of loops that may qualify as pure. 
These transformations are orthogonal to the detection of pure 
loop iterations, and could also be used in our framework. 

Checking for purity of loop iterations is an idea that has 
appeared in other contexts, such as to verify atomicity for 
concurrent data structures [7, 19] and to reduce complexity for 
model checking them (e.g., [4]). 

The Optimal-DPOR algorithm implemented in NIDHUGG, 
handles mutex locks but not await statements. In the jour- 
nal article of the Optimal-DPOR algorithm [3], principles 
for handling other blocking statements are presented. Our 
OPTIMAL-DPOR-AWAIT develops these principles into a 
practical and efficient algorithm, which we have also imple- 
mented in NIDHUGG. As future work, the Optimal-DPOR with 
Observers [5] algorithm, which allows two statements to only 
conflict in the presence of a third event, could also be extended 
(potentially at higher cost) to handle awaits. 


VII. CONCLUDING REMARKS 


We have presented techniques for making SMC with DPOR 
more effective on loops that perform pure iterations, including a 
static program analysis technique to detect pure loop executions, 
a program transformation to block and also remove them, a 
weakening of the standard conflict relation, and an optimal 
DPOR algorithm which handles the so introduced concepts. 
We have implemented the techniques in NIDHUGG, showing 
that they can significantly speed up the analysis of concurrent 
programs with pure loops, and also detect concurrency errors. 
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Abstract—Automating string transformations has been a driv- 
ing application of program synthesis. Existing synthesizers that 
solve this problem produce programs in domain-specific lan- 
guages (DSL) that are designed to simplify synthesis and therefore 
lack nice formal properties. This limitation prevents the synthe- 
sized programs from being used in verification applications (e.g., 
to check complex pre-post conditions) and makes the synthesizers 
hard to modify due to their reliance on the given DSL. 

We present a constraint-based approach to synthesizing trans- 
ducers, a model with strong closure and decidability properties. 
Our approach handles three types of specifications: input-output 
(i) examples, (ii) types expressed as regular languages, and 
(iii) distances that bound how many characters the transducer 
can modify when processing an input string. Our work is the first 
to support such complex specifications and it does so by using 
the algorithmic properties of transducers to generate constraints 
that can be solved using off-the-shelf SMT solvers. Our synthesis 
approach can be extended to many transducer models and it can 
be used, thanks to closure properties of transducers, to compute 
repairs for partially correct transducers. 


I. INTRODUCTION 


String transformations are used in data transformations [1], 
sanitization of untrusted inputs [2], [3], and many other 
domains [4]. Because in these domains bugs may cause serious 
security vulnerabilities [2], there has been increased interest 
in building tools that can help programmers verify [2], [3] and 
synthesize [1], [5], [6] string transformations. 

Techniques for verifying string transformations rely on 
automata-theoretic approaches that provide powerful decid- 
ability properties [2]. On the other hand, techniques for 
synthesizing string transformations rely on domain-specific 
languages (DSLs) [1], [5]. These DSLs are designed to make 
synthesis practical and have to give up the closure and 
decidability properties enabled by automata-theoretic models. 
The disconnect between the two approaches raises a natural 
question: Can one synthesize automata-based models and 
therefore retain and leverage their elegant properties? 

A finite state transducer (FT) is an automaton where each 
transition reads an input character and outputs a string of 
output characters. For instance, Figure 1 shows a transducer 
that ‘escapes’ instances of the " character. So, on input 
a"\"a, the transducer outputs the string a\"\\"a. FTs have 
found wide adoption in a variety of domains [3], [7] because 
of their many desirable properties (e.g., decidable equivalence 
check and closure under composition [8]). There has been 
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(a) Transducer EscapeQuotes 


Examples: {a"a +> a\"a, a\\a e> a\\a, a\a e> a\a, a'a e> a'a, \ e 
\} 

Types: [a"]*\?| (fa"]*\[a"\] [a"]*)* — a*\?] (a*\[a"\]a*)* 
Distance: At most | edit per input character 


(b) Specification to synthesize EscapeQuotes 


Fig. 1: Simplified version of EscapeQuotes from [2]. 


increasing work on building SMT solvers for strings that 
support transducers; the Ostrich tool [9] allows a user to write 
programs in SMT where string-transformations are modelled 
using transducers. One can then write constraints over such 
programs and use an SMT solver to automatically check for 
satisfiability or prove unsatisfiability of those constraints. For 
example, given a program like the following: 


scapeQuotes (x) 
scapeQuotes (y) 
assert (y==z) //Checking idempotence 


y = 
z = 


one can use Ostrich to write a set of constraints and use them 
to prove whether the assertion holds. However, to do so, one 
needs to first write a transducer T that implements the function 
escapeQuotes. However, writing transducers by hand is a 
cumbersome and error-prone task and what we present in this 
paper is an approach for synthesizing such transducers. 

In this paper, we present a technique for synthesizing 
transducers from high-level specifications. We use three dif- 
ferent specification mechanisms to quickly yield desirable 
transducers: input-output examples, input-output types, and 
input-output distances. When provided with the specification 
in Figure 1b, our approach yields the transducer in Figure 1. 
While none of the three specification mechanisms are effective 
in isolation, they work well altogether. Input-output examples 
are easy to provide, but only capture finitely many inputs. 
Similarly, input-output types are a natural way to prevent a 
transducer from generating undesired strings and can often be 


article is licensed under a Creative 


This 
Commons Attribution 4.0 International License 


obtained from function/API specifications. Last, input-output 
distances are a natural way to specify how much of the input 
string should be preserved by the transformation. 

We show that if the size of the transducers is fixed, all such 
specifications can be encoded as a set of constraints whose 
solution directly provides a transducer. While the constraints 
for examples are fairly straightforward, to encode types and 
distances, we show that one can use constraints to “guess” 
the simulation relation and the invariants necessary to prove 
that the transducer has the given type and respects the given 
distance constraint. 

Because our constraint-based approach is based on decision 
procedures and is modular, it can support more complex 
models of transducers: (i) Symbolic Finite Transducers (s- 
FTs), which support large alphabets [10], and (ii) FTs with 
lookahead, which can express functions that otherwise require 
non-determinism. In addition, closure properties of transducers 
allow us to reduce repair problems for string transformations 
to our synthesis problem. 

Contributions: We make the following contributions. 


e A constraint-based synthesis algorithm for synthesizing 
transducers from complex specifications (Sec. IID). 

e Extensions of our synthesis algorithm to more complex 
models—e.g., symbolic transducers and transducers with 
lookahead—and problems—e.g., transducer repair—that 
showcase the flexibility of our approach and the power of 
working with transducers, which enjoy strong theoretical 
properties—unlike domain-specific languages (Sec. IV). 

e ASTRA: a tool that can synthesize and repair transducers 
and compares well with a state-of-the-art tool for synthe- 
sizing string transformations (Sec. V). 


Proofs and additional results are available at [11]. 


II. TRANSDUCER SYNTHESIS PROBLEM 


In this section, we define the transducer synthesis problem. 

A deterministic finite automaton (DFA) over an alphabet X 
is a tuple D = (Qp, 6p, q8*, Fp): Qp is the set of states, 
õp : Qp x E > Qp is the transition function, gi" is the 
initial state, and Fp is the set of final states. The extended 
transition function 67%, : Qp x 4* — Qp is defined as 
65 (q, €) = q and ô (q, au) = 65 (ôn (q, a), u). We say that D 
accepts a string w if ô% (q8, w) € Fp. The regular language 
L£(D) is the set of strings accepted by a DFA D. 

A total finite state transducer (FT) is a tuple T = 
(Qr, 67, 52, qi), where Qr are states and gi’ is the 
initial state. Transducers have two transition functions: ô$ : 
qr XX — qr defines the target state, while 59’ : grxU > X* 
defines the output string of each transition. The extended 
function for states 6#!* is defined analogously to the extended 
transition function for a DFA. The extended function for output 
strings is defined as 62'*(q,c) = © and 62"'*(q,au) = 
ôg" (q,a): 8T“ (8T* (q,a), u). Given a string w we use T(w) 
to denote 62" (q%* w), i.e., the output string generated by 
T on w. Given two DFAs P and Q, we write {P}T{Q} for a 
transducer T iff for every string s in £(P), the output string 


T(s) belongs to £(Q). 


An edit operation on a string is either an insertion/deletion 
of a character, or a replacement of a character with a different 
one. For example, editing the string ab to the string acb 
requires one edit operation, which is inserting a c after the 
a. The edit distance ed_dist(s,t) between two strings s and 
t is the number of edit-operations required to reach t from s. 
We use len(w) to denote the length of a string w. The mean 
edit distance mean_ed_dist(s,t) between two strings s and t 
is defined as ed_dist(s,t)/len(s). For example, the mean edit 
distance from ab to acb is 1/2 = .5. 

We can now formulate the transducer synthesis problem. 
We assume a fixed alphabet X. If the specification requires 
that s is translated to t, we write that as s +> t. 


Problem Statement 1 (Transducer Synthesis). The transducer 
synthesis problem has the following inputs and output: 
Inputs 

e Number of states k and upper bound | on the length of 

the output of each transition. 

e Set of input-output examples E = |s > t]. 

e Input-output types P and Q, given as DFAs. 

e A positive upper bound d € Q on the mean edit distance. 
Output A total transducer T = (Qr, 8%, 59¢, git") with k 
states such that: 

e Every transition of T has an output with length at most 

l, ie., Var E€ Qr,a € X. len(d2"*(q,a)) < l. 

e T is consistent with the examples: Ys œ> t € E. T (s) = t. 

e T is consistent with input-output types, i.e., {P}T{Q}. 

e For every string w € P, meaneddist(w,T(w)) < d. 


The synthesis problem that we present here is for FTs, 
and in Section II, we provide a sound algorithm to solve 
it using a system of constraints. One of our key contributions 
is that our encoding can be easily adapted to synthesizing 
richer models than FTs (e.g., symbolic transducers [8] and 
transducers with regular lookahead), while still using the same 
encoding building blocks (Section IV). 


III. CONSTRAINT-BASED TRANSDUCER SYNTHESIS 


In this section, we present a way to generate constraints to 
solve the transducer synthesis problem defined in Section II. 
The synthesis problem can then be solved by invoking a 
Satisfiability Modulo Theories (SMT) solver on the constraints. 

We use a constraint encoding, rather than a direct algorith- 
mic approach because of the multiple objectives to be satisfied. 
Synthesizing a transducer that translates a set of input-output 
examples is already an NP-Complete problem [12]. On top of 
that, we also need to handle input-output types and distances. 
Our encoding is divided into three parts, one for each ob- 
jective, which are presented in the following subsections. This 
division makes our encoding very modular and programmable. 
In Section IV we show how it can be adapted to different trans- 
ducer models and problems. We include a brief description of 
the size of the constraint encoding in the extended version. 

The transducer we are synthesizing has k (part of the 
problem input) states Qr = {q0, ---, qk—-1 }. We often use gi"" 
as an alternative for qo, the initial state of T. 
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We illustrate how our encoding represents a transition 


a/be : x ‘ 
qı ——> q2. The target state is captured using an uninterpreted 


function d&* : Qr x © > Qr, e.g., d°*(q1, a) = q2. Repre- 
senting the output of the transition is trickier because its length 
is not known a priori. The output bound / allows us to limit the 
number of characters that may appear in the output. We use an 
uninterpreted function d?¥* : Qr x U x {0,...,/-1} > X to 
represent each character in the output string; in our example, 
dap" (qı,a,0) = b and d&i*(q,a,1) = c. Since an output 
string’s length can be smaller than l, we use an additional 
uninterpreted function d9 : Qr x E — {0,...,1} to 
model the length of a transition’s output; in our example 
det (qi, a) = 2. We say an assignment to the above variables 
extends to a transducer T for the transducer T obtained by 
instantiating 5°* and 6°“ as described above. 


A. Input-output Examples 


Goal: For each input output-example s > t € E, T should 
translate s to t. 


Translating s to the correct output string means that 
out (gimtt s) = t. Generating constraints that capture this 
behavior of T on an example is challenging because we do not 
know a priori what parts of t are produced by what steps of the 
transducer’s run. Suppose that we need to translate s = aga, to 
t = bob, bg. A possible solution is for the transducer to have 


b bib ; ; 
Sof °, gl ae 2s q2. Another possible solution 


might be to instead have qo Zo bobi, ql a q2. Notice that 
the two runs traverse the same states but produce different 
parts of the output strings at each step. Intuitively, we need a 
way to “track” how much output the transducer has produced 
before processing the 7-th character in the input and what state 
it has landed in. For every input example s+> t such that 
S = Qo ``- an and t = bo--- bm, we introduce an uninterpreted 
function config, : {0,...,n} — {0,...,m} x Qr such 
that config,(i) = (j,qr) iff after reading ao---aj_1, 
the transducer T has produced the output bo---bj;_; and 
reached state gr—i.e., 62!" (qo, ao +++ @i—1) = bo +++ bj—1 and 
ôF” (qo, ao + ai—1) = qr- 

We describe the constraints that describe the behavior of 
config,. Constraint | states that a configuration must start 
at the initial state and be at position O in the output. 


config,(0) = (0, q7”) (1) 


Constraint 2 captures how the configuration is updated when 
reading the 7-th character of the input. For every 0 < i < n, 
0<j<m,ce€®%, and gr € Qr: 


the run qo 


config,(t) = (j,qr) Aa =c> 
[ VAN (dh (ar, c, 2) = bjt V z > Aven lar, c))A (2) 
O<z<l 
configs(i+ 1) = (j + dieslar, c), a (ar, c))] 
Informally, if the i-th character is c and the transducer has 


reached state qr and produced the characters bg ---b;_1 so 
far, the transition reading c from state qr outputs characters 


bj- -bj+f-1, where f is the output length of the transition. 
The next configuration is then (j + f,d°**(qr,c)). 

Finally, Constraint 3 forces T to be completely done with 
generating t when s has been entirely read. Recall that 
len(s) = n and len(t) = m. 


\/ config,(n) = (m, qr) (3) 
qr EQr 
The encoding for examples is sound and complete [11]. 
B. Input-Output Types 
Goal: T should satisfy the property {P}T{Q}. 


Encoding this property using constraints is challenging 
because it requires enforcing that when 7’ reads one of the 
(potentially) infinitely many strings in P it always outputs 
a string in Q. To solve this problem, we draw inspiration 
from how one proves that the property {P}T{Q} holds— 
i.e., using a simulation relation that relates runs over P, 
T and Q. Intuitively, if P has read some string w, we 
need to be able to encode the behavior of T in terms of 
w, i.e., what state of T this transducer is in after reading 
w and what output string w’ it produced. Further, we also 
need to be able to encode in which state Q would be after 
reading the output string w’. We do this by introducing a 
function sim: Qp x Qr x Qo — {0,1}, which preserves 
the following invariant: sim(qp,¢r,qq) holds if there exist 
strings w, w’ such that 6% (qi, w) = qp, 69" (qi"", w) = ar, 
dg’ (qr, w) =w’, and S (g, w) = aa. 

Constraint 4 states the initial condition of the simulation— 
i.e., P, T, and Q are in their initial states. 


sat 
140) (4) 

Constraint 5 encodes how we advance the simulation rela- 
tion for states gp, qr, qq and for a character c € X, using free 


variables cg... , ¢j-1 and qo gis Wo that are separate for each 
combination of qp,qr,qq, and c: 


anit 


init 
dT 


sim(gi 


sim(gp,ar,qq) > \(aiei(ar, c) = 2 > 


0<z<1 
[A ack" (ar, c, £)=ce]^ 
0<r<z 
l= ^ N =de lE ', cn-1))A 


1<r<z 
sim(ðp(qp, c), da™ (qr, c), q)) 
(5) 


Intuitively, if sim(qp,qr,qq) and we read a character c, 
P moves to 6p(qp,c) and T moves to d®* (qp, c). However, 
we also need to advance Q and the d?2> symbols produced 
by ağ". We hard-code the transition relation dg in an un- 
interpreted function dg : QQ x X — QQ, and apply it to 
compute the output state reached when reading the output 
string. E.g., if d?25(qr,c) = 2 and d2}"(q¢r,c,0) = co and 
deh" (gr, c, 1) = c1, the next state in Q is do (do (qQ, co), c1). 

Lastly, Constraint 6 states that if we encounter a string in 
L£(P)—ie., P is in a state qp E€ Fp—the relation does not 
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contain a state qo ¢ Fg. Since Q is deterministic, this means 
that Q accepts T’s output. 


VAN \ asim(qp, qT, qQ) 


qPEFP qa¢Fa 


(6) 


The constraint encoding for types is sound and complete [11]. 


C. Input-output Distance 


Goal: The mean edit distance between any input string w 
in £(P) and the output string T (w) should not exceed d. 


Capturing the edit distance for all the possible inputs in the 
language of P and the corresponding outputs produced by the 
transducer is challenging because these sets can be infinite. 
Furthermore, exactly computing the edit distance between an 
input and an output string may involve comparing characters 
appearing on different transitions in the transducer run. For 
example, consider the transducer shown in Figure 2a and 
suppose that we are only interested in strings in the input type 
P = a(ba) xa. The first transition from go deletes the a, 
therefore making 1 edit. This transducer has a cycle between 
states qı and q2, which can be taken any number of times. 
Each iteration, locally, would require that we make 2 edits: 
one to change the b to a, and the other to change the a to 
b. However, the total number of edits made over any string in 
the input type P = a (ab) xa by this transducer is 1, because 
the transducer changes strings of the form a (ba)”a to be of 
the form (ab)”a. Looking at the transitions in isolation, we 
are prevented from deducing that the edit distance is always 
1 because the first transition delays outputting a character. If 
there was no such delay, as is the case for the transducer in 
Figure 2b, which is equivalent on the relevant input type to 
the one in Figure 2a, then this issue would not arise. 

We take inspiration from Benedikt et al. [13] and focus 
on the simpler problem of synthesizing a transducer that 
has ‘aggregate cost’ that satisfies the given objective.! For 
init a9 /Yo 


a transducer T and string s = ao...an, let q7 


qh... qh Salie git! be the run of s on T. Then, the 
aggregate cost of T on s is the sum of the edit distances 
ed_dist(a;, yi) over all indices 0 < i < n. The mean aggregate 
cost of T on s is the aggregate cost divided by len(s), the 
length of s. It follows that if T has a mean aggregate cost 
lower than some specified d for every string, then it also has 
a mean edit distance lower than d for every string. 

However, the mean aggregate cost overapproximates the edit 
distance, e.g., the transducer in Figure 2a has mean aggregate 
cost 1, while the mean edit distance when considering only 
strings in P = a (ab) xa is less than 1/2. For this reason, if 
the mean edit distance objective was set to 1/2, our constraint 
encoding can only synthesize the transducer in Figure 2b, and 
not the equivalent one in Figure 2a. 


l Benedikt et al. [13] studied a variant of the problem where the distance 
is bounded by some finite constant. Their work shows that when there is a 
transducer between two languages that has some bounded global edit distance, 
then there is also a transducer that is bounded (but with a different bound) 
under a local method of computing the edit distance—i.e., one where the 
computation of the edit distance is done transition by transition. 


boa ab 
> > 
start (e) = Q; (e) start 


(a) Transducer with delayed output 


bob 


CE) ate (0) 


(b) Transducer without delay 


Fig. 2: Transducers with and without delay. 


Our encoding is complete for transducers in which the 
aggregate cost coincides with the actual edit distance. We 
leave the problem of being complete with regards to global 
edit distance as an open problem. In fact, we are not even 
aware of an algorithm for checking (instead of synthesizing) 
whether a transducer satisfies a mean edit distance objective.” 
In Section IV-B, we present transducers with lookahead, which 
can mitigate this source of incompleteness. Furthermore, our 
evaluation shows that using the aggregate cost and enabling 
lookahead are both effective techniques in practice. 

We can now present our constraints. First, we provide 
constraints for the edit distance of individual transitions (recall 
that transitions are being synthesized and we therefore need to 
compute their edit distances separately). Secondly, we provide 
constraints that implicitly compute state invariants to capture 
the aggregate cost between input and output strings at various 
points in the computation. We are given a rational number d as 
an input to the problem, which is the allowed distance bound. 


Edit Distance of Individual Transitions. To compute the edit 
distance between the input and the output of each transition, 
we introduce a function ed: Qr x © — Z. For a transition 
from state gr reading a character c, ed(qr,c) represents 
the edit distance between c and 62"'(qr,c). Notice that this 
quantity is bounded by the output bound /. The constraints to 
encode the value of this function are divided into two cases: 
i) the output of the transition contains the input character c 
(Constraint 7), ii) the output of the transition does not contain 
the input character c (Constraint 8). In both cases, the values 
are set via a simple case analysis on whether the length of 
the output is 0 (edit distance is 1) or not (the edit distance is 
related to the length of the output). 


[VV ack larcz) =e] > 
0<z<aot' (qr,c) 


len 


7 
eagat ed(er,e) = 1A = 
(dace 20S calene) Heetgneady 

[A ak! (ar, c, z) # el > 
O<z<d{ei (arc) (8) 


[aes (ar, c) = 0 > ed(gr,c) = 1A 
ack (ar, c) #0 > ed(qr, c) = ak (ar, c)] 


2The mean edit distance is similar to mean payoff [14], which discounts 
a cost by the length of a string and looks at the behavior of a transducer in 
the limit. Our distance is different because 1) it looks at finite-length strings, 
and 2) it requires computing the edit distance, which cannot be done one 
transition at a time. 
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Edit Distance of Arbitrary Strings. Suppose that T has the 


iA b : i 
transitions qo GIEN ql as q2, and the specified mean edit 


distance is d = 0.5. The edit distance is O for the first transition 
and 2 for the second one. For the input string aa, the mean 
aggregate cost is 2/2, which means that the specification is 
not satisfied. In general, we cannot keep track of every input 
string in the input type and look at its length and the number 
of edits that were made over it. So, how can we compute 
the mean aggregate cost over any input string? The first part 
of our solution is to scale the edit distance over a single 
transition depending on the specified mean edit distance. This 
operation makes it such that an input string is under the edit 
distance bound if the sum of the weighted edit distances of 
its transitions is > 0. The invariant we need to maintain is 
that the sum of the weights at any stage of the run gives us 
where we are with regard to the mean aggregate cost. For each 
transition we compute the difference between the edit distance 
over the transition and the specified mean edit distance d. We 
introduce the uninterpreted function wed : Qr x U —> Q, 
which stands for weighted edit distance. For a transition at 
qr reading a character c, the weighted edit distance is given 
by wed(qgr,c) = d — ed(qr,c). The sum of the weights of 
all transitions tells us the cumulative difference. Going back to 
our example, the weighted edit distances of the two transitions 
are wed(qo,a) = 0.5 and wed(qi,a) = —1.5, making the 
cumulative distance —1 and implying that the specification is 
violated. We can now compute the mean edit distance over 
a run without keeping track of the length of the run and the 
number of edits performed over it. 


We still need to compute the weighted edit distance for 
every string in the possibly infinite language £(P). Building 
on the idea of simulation from the previous section, we 
introduce a new function called en: Qp x Qr x Qo > Q, 
which tracks an upper bound on the sum of the distances so 
far at that point in the simulation. This function is similar 
to a progress measure, which is a type of invariant used 
to solve energy games [15], a connection we expand on in 
Section VI. In particular, we already know that if there exist 
strings w, w’ such that 63(q%",w) = qp, 67" (qih™,w) = 
qr, Sg" (qr, w) = w, and d6(qg",w') = qq, then 
we have sim(qp,q¢r,9dq). Let this run over T be denoted 
G@n—1/Yn-1 


ao/Yo 1 n—1 


qr, where w = 
dg°+*Gn—1, W = Yot: Yn—1, and qr = q}. We have that 
en(qP, 4r, 49) = Eico wed(gp, ai). 

The en function is a budget on the number of edits we 
can still perform. At the initial states, we start with no ‘initial 
credit’ and the energy is 0. 


anit 


en(qp a7”, q8”) =0 


(9) 


Constraint 10 bounds the energy budget according to the 
weighted edit distance of a transition by computing the mini- 
mum budget required at any point to still satisfy the distance 
bound. For each combination of gp, qr, qQ, and c € X, the 


constraint uses free variables co,...,c; and qd: Deg qg `: 
[Naklar c)=2 > 
0<z<l 
[A a" (ar, c, 2)=ce] Alag =ag ^ N =de lag, ca-1)]A 
O<r<z 1<r<z 
en(qp, qT; qq) 2 en(dp(qp, c), ds‘ (qr, c), q&)—wed(qr, c)) 


(10) 


In our example, Constraint 10 encodes that the energy at 
qo can be 1 less than that at qı, but that the energy at qı 
needs to be 3 greater than at q2 since we need to spend 3 edit 
operations over the second transition. 

At any point during a run, the transducer is allowed to go 
below the mean edit distance and then ‘catch up’ later because 
we only care about the edit distance when the transducer has 
finished reading a string in £(P). Therefore, when we reach a 
final state of P, the transducer should not be in ‘energy debt’. 


VAN sim(qp,¢r,dq) > en(qP, qr, qQ) = 0 (11) 


qpEFp 


The encoding presented in this section is sound [11]. 


IV. RICHER MODELS AND SPECIFICATIONS 


We extend our technique to more expressive models (Sec- 
tions IV-A and IV-B) and show how our synthesis approach 
can be used not only to synthesize transducers, but also 
to repair them (Section IV-C). In the extended version, we 
describe an encoding of an alternative distance measure [11]. 


A. Symbolic Transducers 


Symbolic finite automata (s-FA) and transducers (s-FT) ex- 
tend their non-symbolic counterparts by allowing transitions to 
carry predicates and functions to represent (potentially infinite) 
sets of input characters and output strings. Figure 3a shows an 
s-FT that extends the escapeQuotes transducer from Figure la 
to handle alphabetic characters. The bottom transition from 
qo reads a character " (bound to the variable x) and outputs 
the string \" (i.e., a \ followed by the character stored in x). 
Symbolic finite automata (s-FA) are s-FTs with no outputs. To 
simplify our exposition, we focus on s-FAs and s-FTs that only 
operate over ASCII characters that are ordered by their codes. 
In particular, all of our predicates are unions of intervals over 
characters (i.e., x Æ \ is really the union of intervals [NUL- 
[] and []-DEL]); we often use the predicate notation instead 
of explicitly writing the intervals for ease of presentation. 
Furthermore, we only consider two types of output functions: 
constant characters and offset functions of the form x + k that 
output the character obtained by taking the input x and adding 
a constant k to it—e.g., applying x + (—32) to a lowercase 
alphabetic letter gives the corresponding uppercase letter. 

In the rest of the section, we show how we can solve the 
transducer synthesis problem in the case where P and Q are 
s-FAs and the goal is to synthesize an s-FT (instead of an 
FT) that meets the given specification. Intuitively, we do this 
by ‘finitizing’ the alphabet of the now symbolic input-output 
types, synthesizing a finite transducer over this alphabet using 
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(a) escapeQuotes s-FT (b) F(escapeQuotes) 


minterms: [x 4" A x 
witness char: wit ([x 


\} [z = "], [z = \] 
TATAN awita] 


", wit([æ = \D=\ 


(c) Set of minterms and their witness elements 


Fig. 3: Example of Finitization 


the technique presented in Section III, and then extracting an 
s-FT from the solution. 


Finitizing the Alphabet. The idea of finitizing the alphabet 
of s-FAs is a known one [8] and is based on the con- 
cept of minterms, which is the set of maximal satisfiable 
Boolean combinations of the predicates appearing in the s- 
FAs. For an s-FA M, we can define its set of predicates as: 
Predicates(M) = {¢ | q 2 q’ € m}. The set of minterms 
mterms(M) is the set of satisfiable Boolean combinations of 
all the predicates in Predicates(M). For example, for the set 
of predicates over the s-FT escapeQuotes in Figure 3a, we have 
that mterms(escapeQuotes) = {x A"ArA\,c=",x 
\}. The reader can learn more about minterms in [8]. We 
assign each minterm a representative character, as indicated 
in Figure 3c, and then construct a finite automaton from the 
resulting finite alphabet X. For a character c € X, we refer 
to its corresponding minterm by mt(c). In the other direction, 
for each minterm Y% € minterms(M), we refer to its uniquely 
determined representative character by wit(1). 

For an s-FA M, we denote its corresponding FA over the 
alphabet mterms(M) with F(M). Given an s-FA M, the set 
of transitions of F (M) is defined as follows: 


On(m)=19 ne qd'\q A d Aw € mterms(M)AMsSat(Yrp)} 


This algorithm replaces a transition guarded by a predicate @ 
in the given s-FA with a set of transitions consisting of the 
witnesses of the minterms where ¢ is satisfiable. In interval 
arithmetic this is the set of intervals that intersect with the 
interval specified by ¢. The transition from qı guarded by the 
predicate [a + \] in Figure 3a intersects with 2 minterms 
|r A "Aa Æ \] and [x = "]. As a result, we see that this 
transition is replaced by two transitions in Figure 3b, one that 
reads " and another that reads a. 


From FTs to s-FTs. Once we have synthesized an FT T, 
we need to extract an s-FT from it. There are many s-FTs 
equivalent to a given FT and here we present one way of doing 
this conversion which is used in our implementation. Let the 
size of an interval J (the number of characters it contains) be 
given by size(I), and the offset between 2 intervals J} and 
In (i.e. the difference between the least elements of J; and 


In) be given by offset(I1, I2). Suppose we have a transition 


q En q', where c, y; E€ X. Then, we construct a transition 


TOi, q’, where for each y;, the corresponding 


function f; is determined by the following rules (x always 
indicates variable bound to the input predicate): 

1) If c = y;, then f; = (x), i.e. the identity function. 

2) If mt(c) and mt(y;) consist of single intervals I, and Ip, 

respectively, such that size(I1) = size(I2) , then fi = 
(a + offset(I,, I2)). For instance, if the input interval is 
[a-z] and the output interval is [A-Z], then the output 
function is (x + (—32)), which maps lowercase letters to 
uppercase ones. 

3) Otherwise f; = y;—i.e., the output is a character in the 

output minterm. 

While our s-FT recovery algorithm is sound, it may apply 
case 3 more often than necessary and introduce many con- 
stants, therefore yielding a transducer that does not generalize 
well to unseen examples. Our evaluation shows that our 
technique works well in practice. The proof of soundness of 
this algorithm in the extended version [11]. 


B. Synthesizing Transducers with Lookahead 


Deterministic transducers cannot express functions where 
the output at a certain transition depends on future characters 
in the input. Consider the problem of extracting all substrings 
of the form <x> (where x Æ <) from an input string. This 
is the getTags problem from [16]. A deterministic transducer 
cannot express this transformation because when it reads < 
followed by x it has to output <x if the next character is a > 
and nothing otherwise. However, the transducer does not have 
access to the next character! 

Instead, we extend our technique to handle deterministic 
transducers with lookahead, i.e., the ability to look at the string 
suffix when reading a symbol. Formally, a Transducer with 
Regular Lookahead is a pair (T, R) where T is an FT with 
“<r = Qr x È, and R is a total DFA with Nz = ©. The 
transducer T now has another input in its transition function, 
although it still only outputs characters from ©, i.e., 62" : 
Qrx(Qrxd) — $, and Ost : Qrx(QrRxd) = Qr. The se- 
mantics is defined as follows. Given a string w = ao: dn, we 
define a function rw such that ru (i) = dr(qip"’, an +++ @i41)- 
In other words, rą(i) gives the state reached by R on the 
reversed suffix starting at i+1. At each step 7, the transducer T 
reads the symbol (a;i, rw(i)). The extended transition functions 
now take as input a lookahead word, which is a sequence of 
pairs of lookahead states and characters, i.e., from (Qpr x £)“. 

To synthesize transducers with lookahead, we introduce 
uninterpreted functions dz for the transition function of R, 
and look, for the r-values of w on R. We also introduce a 
bound kr on the number of states in the lookahead automaton 
R (our algorithm has to synthesize both T and R). The 
modified constraints needed to encode input-output types and 
input-output examples to use lookahead are described in the 
extended version of the paper [11]. Part of the transducer with 
lookahead we synthesize for the getTags problem is shown 
in Figure 4. Notice that there are 2 transitions out of qı for 
the same input but different lookahead state: the string <x is 
outputted when the lookahead state is 71. 
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(a) Subset of transitions in T (b) Lookahead automaton R 


Fig. 4: Regular lookahead for getTags 


Lookahead and aggregate cost: Lookahead can help rep- 
resenting transducers, even deterministic ones, in a way that 
has lower aggregate cost—i.e., the aggregate cost better ap- 
proximates the actual edit distance. Suppose that we want to 
synthesize a transducer that translates the string abc to ab 
and the string abd to bd. This translation can be done using 


EA i a a/e b/e 

a deterministic transducer with transitions qo —> q1 — qo, 
followed by two transitions from qə that choose the correct 
output based on the next character. Such a transducer would 
have a high aggregate cost of 4, even though the actual edit 
distance is 1. In contrast, using lookahead we can obtain a 
transducer that can output each character when reading it; this 
transducer will have aggregate cost 1 for either string. We 
conjecture that for every transducer T, there always exists an 
equivalent transducer with regular lookahead (T’, R) for which 
the edit distance computation for aggregate cost coincides with 
the actual edit distance of T. 


C. Transducer Repair 


In this section, we show how our synthesis technique can 
also be used to “repair” buggy transducers. The key idea is 
to use the closure properties of automata and transducers— 
e.g., closure under union and sequential compositions [8]— 
to reduce repair problems to synthesis ones. The ability 
to algebraically manipulate transducers and automata is one 
of the key aspects that distinguishes our work from other 
synthesis works that use domain-specific languages [1], [5]. 

We describe two settings in which we can repair an incorrect 
transducer Thag: 1. Let {P}Tica{Q} be an input-output type 
violated by Tyaq and let Out p(Thaa) be the finite automaton 
describing the set of strings Taq can output when fed inputs in 
P (this is computable thanks to closure properties of transduc- 
ers). We are interested in the case where Out p(Thaa)\Q 4 O— 
i.e., Tpaq can produce strings that are not in the output type. 
2. Let [s= t| be a set of input-output examples. We are 
interested in the case where there is some example s +> t such 
that Thaa(s) Æ f 


Repairing from the Input Language. This approach syn- 
thesizes a new transducer for the inputs on which Tpaa is 
incorrect. Using properties of transducers, we can compute 
an automaton describing the exact set of inputs Paa C P for 
which Tpaa does not produce an output in Q (see pre-image 
computation in [10]). Let restrict(T,L) be the transducer 
that behaves as T if the input is in L and does not produce 
an output otherwise (closure under restriction [10]). If we 
synthesize a transducer T, with type {Pyaa}Ti{Q}, then the 


transducer restrict(T1, Phaa)Urestrict(Thaa, P\ Paa) satisfies 
the desired input-output type (closure under union). 


Fault Localization from Examples. We use this technique 
when Tpaa is incorrect on an example. We can compute a 
set of “suspicious” transitions by taking all the transitions 
traversed when T(s) 4 t for some s> t € E (ie., one of 
these transitions is wrong) and removing all the transitions 
traversed when T(s) = t for some s > t € E (ie., transitions 
that are likely correct). Essentially, this is a way of identifying 
Praa when Thad is Wrong on some examples. We can also use 
this technique to limit the transitions we need to synthesize 
when performing repair. 


V. EVALUATION 


We implemented our technique in a Java tool ASTRA 
(Automatic Synthesis of TRAnsducers), which uses Z3 [17] to 
solve the generated constraints. We evaluate using a 2.7 GHz 
Intel Core i5, RAM 8 GB, with a 300s timeout. 


Q1: Can ASTRA synthesize practical transformations? 


Benchmarks. Our first set of benchmarks is obtained from 
Optician [5], [6], a tool for synthesizing lenses, which are 
bidirectional programs used for keeping files in different data 
formats synchronized. We adapted 11 of these benchmarks 
to work with ASTRA (note that we only synthesize one- 
directional transformations), and added one additional bench- 
mark extrAcronym2, which is a harder variation (with a larger 
input type) of extrAcronym. We excluded benchmarks that 
require some memory, e.g., swapping words in a sentence, as 
they cannot be modeled with transducers. Our second set of 
benchmarks (Miscellaneous) consists of 6 problems we created 
based on file transformation tasks (unixToDos, dosToUnix and 
CSVSeparator), and s-FTs from the literature-escapeQuotes 
from [18], getTags and quicktimeMerger from [16]. All of the 
benchmarks require synthesizing s-FTs and getTags requires 
synthesizing an s-FT with lookahead (details in Table I). 

To generate the examples, we started with the examples that 
were used in the original source when available. In 5 cases, 
ASTRA synthesized a transducer that was not equivalent to the 
one synthesized by Optician. In these cases, we used ASTRA to 
synthesize two different transducers that met the specification, 
computed a string on which the two transducers differed, and 
added the desired output for that string as an example. We 
repeated this task until ASTRA yielded the desired transducer 
and we report the time for such sets of examples. The ability 
to check equivalence of two transducers is yet another reason 
why synthesizing transducers is useful. For each benchmark 
we chose a mean edit distance of 0.5 when the transformation 
could be synthesized with this distance and of 1 otherwise. 


Effectiveness of ASTRA. ASTRA can solve 15/18 bench- 
marks (13 in <Is and 2 under a minute) and times out on 3 
benchmarks where both P and Q are big. 

While the synthesized transducers have at most 3 states, we 
note that this is because ASTRA synthesizes total transducers 
and then restricts their domains to the input type P. This is 
advantageous because synthesizing small total transducers is 


300 


TABLE I: ASTRA’s performance on the synthesis benchmarks. The right-most set of columns gives the synthesis time for ASTRA and Optician 
(under 2 different configurations). The middle set of columns gives the sizes of the parameters to the synthesis problem: Qp and QQ denote 
the number of input and output states, and dp and dg denote the number of transitions in the input and output types, respectively. A X 
represents a benchmark that failed. — stands in for data that is not available; this is because we only re-ran Optician on the benchmarks 
that were already encoded in its benchmark set, plus a few additional ones for comparing between the tools that we wrote ourselves. 


easier than synthesizing transducers that require more states to 
define the domain. For instance, when we restrict the solution 
of extrAcronym2 to its input type, the resulting transducer has 
11 states instead of the 2 required by the original solution! 


Comparison with Optician. We do not compare ASTRA to 
tools that only support input-output examples. Instead, we 
compare ASTRA to Optician on the set of benchmarks common 
to both tools. Like ASTRA, Optician supports input-output 
examples and types, but the types are expressed as regular 
expressions. Furthermore, Optician also attempts to produce 
a program that minimizes a fixed information theoretical 
distance between the input and output types [5]. 


Optician is faster when the number of variables in the 
constraint encoding increases, while ASTRA is faster on the 
normalizeSpaces benchmark. Optician, which uses regular 
expressions to express the input and output types, does not 
work so well with unstructured data. To confirm this trend, we 
wrote synthesis tasks for the escapeQuotes and getTags bench- 
marks in Optician and it was unable to synthesize those—e.g., 
escapeQuotes requires replacing every " character with \". To 
further look at the reliance of Optician on regular expressions, 
we converted the regular expressions used in the lens synthesis 
benchmarks to automata and then back to regular expressions 
using a variant of the state elimination algorithm that acts on 
character intervals. This results in regular expressions that are 
not very concise and might have redundancies. Optician could 
only solve 4/11 benchmarks that it was previously synthesizing 
(Optician-re in Table I). 

Answer to Q1: ASTRA can solve real-world benchmarks 
and has performance comparable to that of Optician for similar 
tasks. Unlike Optician, ASTRA does not suffer from variations 
in how the input and output types are specified. 


Benchmark QP Qo dp ôg X E k ç l d | ASTRA (s) Optician (s) Optician-re (s) 
extrAcronym 6 3 10 3 Sr T E S 0.11 0.05 x 
extrAcronym2 6 3 16 3 SP ga 222 al 1 0.42 — — 
extrNum 15 13 17 12 3 1 1 1 1 0.93 0.05 0.07 
extrQuant 4 3 8 5 2 1 2 1 1 0.19 0.09 x 

= | normalizeSpaces 7 6 19 10 2 2 2 ON OL 0.46 16.64 x 
‘S| extrOdds 15 9 29 13 Sy 3-32. ol 15.87 0.12 x 
= | capProb 3 3 3 3 2 2 2 T 1 0.05 0.05 x 
© | removeLast 6 s E 8p Be 535 DoT 5 0.21 0.15 0.07 
sourceTo Views 18 7 26 15 SuN Sie DE oe 50.92 0.06 x 
normalizeNamePos 19 7 35 24 13 1 6 2 1 x 0.05 0.10 
titleConverter 22 13 41 41 15 1 3 1 1 x 0.07 x 
bibtextToReadable 14 11 AL 35 12 21 5 1 ~«21 x 0.64 0.15 

2 unixToDos 5 7 17 19 4 4 2 2 5 1.24 — — 
8 | dosToUnix 7 5 19 17 4 4 2 1 5 0.41 — — 
S | CSVSeparator 5 5 9 9 4 1 1 1 1 0.142 — — 
3 | escapeQuotes 2 2 6 5 3) 5 52 2" oh 0.188 x x 
2 quicktimeMerger 7 3 9 3 De 5 E we BS 0.075 — — 
= | getTags 3 3 9 4 3 5 2 2 1 0.95 x x 


Q2: Can ASTRA repair transducers in practice? 


Benchmarks. We considered the benchmarks in Table II. 
The only pre-existing benchmark that we found was es- 
capeQuotes, through the interface of the Bek programming 
language used for verifying transducers [18]. We generated 
11 additional faulty transducers to repair in the following two 
ways: (i) Introducing faults in our synthesis benchmarks: We 
either replaced the output string of a transition with a constant 
character, inserted an extra character, or deleted a transition 
altogether. (ii) Incorrect transducers: We intentionally provided 
fewer input-output examples and used only example-based 
constraints on some of our synthesis benchmarks. 

All the benchmarks involve s-FTs. Three benchmarks are 
wrong on input-output types and examples, and the rest are 
only wrong on examples. Additionally, we note that to repair 
a transducer, we need the “right” set of minterms. Typically, 
the set of minterms extracted from the transducer predicates is 
the right one, but in the case of the escapeBrackets problems, 
ASTRA needs a set of custom minterms we provide manually. 
We are not aware of another tool that solves transducer repair 
problems and so do not show any comparisons. 


Effectiveness of ASTRA. We indicate the number of suspi- 
cious transitions identified by our fault localization procedure 
(Section IV-C) in the column labeled 67,,,. In many cases, 
ASTRA can detect 50% of the transitions or more as being 
likely correct, therefore reducing the space of unknowns. 

We compare 2 different ways of solving repair problems 
in ASTRA. One uses the repair-from-input approach described 
in Section IV-C (Default in Table II). The second approach 
involves using a ‘template’, where we supply the constraint 
solver with a partial solution to the synthesis problem, based 
on the transitions that were localized as potentially buggy 
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TABLE II: ASTRA’s performance on the repair benchmarks. Default is the case where a new transducer is synthesized for Pyaa and Template 
is the case where a partial solution to the solver is provided. The 6r,,, column gives the number of transitions that were localized by the 
fault-localization procedure as a fraction of the total number of transitions in the transducer. The other columns that describe the parameters 
of the synthesis problem in the default case are the same as for Table I. 


(Template in Table II). 

ASTRA can solve 9/12 repair benchmarks (all in less than 
1 second). The times using either approach are comparable in 
most cases. While one might expect templates to be faster, this 
is not always the case because the input-output specification 
for the repair transducer is small, but providing a template 
requires actually providing a partial solution, which in some 
cases happens to involve many constraints. 

Answer to Q2: ASTRA can repair transducers with varying 
types of bugs. 


VI. RELATED WORK 


Synthesis of string transformations. String transformations 
are one of the main targets of program synthesis. Gulwani 
showed they could be synthesized from input-output examples 
[1] and introduced the idea of using a DSL to aid synthe- 
sis. Optician extended the DSL-based idea to synthesizing 
lenses [5], [6], which are programs that transform between 
two formats. Optician supports not only examples but also 
input-output types. While DSL-based approaches provide good 
performance, they are also monolithic as they rely on the 
structure of the DSL to search efficiently. ASTRA does not 
rely on a DSL and can synthesize string transformations 
from complex specifications that cannot be handled by DSL- 
based tools. Moreover, transducers allow applying verification 
techniques to the synthesized programs (e.g., checking whether 
two solutions are equivalent). One limitation of transducers 
is that they do not have ‘memory’, and consequently ASTRA 
cannot be used for data-transformation tasks where this is 
required—e.g., mapping the string Firstname Lastname 
to Lastname, Firstname—something Optician can do. 
We remark that there exist transducer models with such 
capabilities [19] and our work lays the foundations to handle 
complex models in the future. 


Synthesis of transducers. Benedikt et al. studied the “bounded 
repair problem’, where the goal is to determine whether there 
exists a transducer that maps strings from an input to an 
output type using a bounded number of edits [13]. Their 


Benchmark Qr Qo dep bq 4% E k l d Tta | Default (s) Template (s) 
swapCase1 2 1 6 3 3 2 Loh | 3/3 0.04 0.02 
3 | swapCase2 2 1 4 3 3 2 1 1 1 1/2 x x 
3 | swapCase3 2 1 6 3 3 2 1 1 1 1/3 0.06 0.05 
‘= | escapeBrackets1 2 6 16 36 8 4 1 4 4 1/3 0.69 0.42 
= | escapeBrackets2 1 6 1 7 6 5 1 4 4 1/2 x x 
(& | escapeBrackets3 2 7 8 36 9 5 1 4 4 2/3 1.12 0.34 
caesarCipher 2 1 4 2 3 1 1 1 1 1/1 X X 
extrAcronym2 11 3 30 3 3 3 2 1 1 12/30 0.59 10.15 
£ capProb 3 3 3 I A 2 Qe E I 3/3 0.04 0.04 
2 extrQuant 8 3 16 5-2 1 2 1 1 5/10 0.37 0.51 
removeLast 6 3 8 Se 3 2 2) = 5 718 0.40 1.08 
escapeQuotes 3 2 9 5 3 5 2 1 1 3/5 0.17 0.10 


work was the first to identify the relation between solving 
such a problem and solving games, an idea we leverage in 
this paper. However, their work is not implemented, cannot 
handle input-output examples, and therefore shies away from 
the source of NP-Completeness. Hamza et al. studied the 
problem of synthesizing minimal non-deterministic Mealy ma- 
chines (transducers where every transition outputs exactly one 
character), from examples [12]. They prove that the problem 
of synthesizing such transducers is NP-complete and provide 
an algorithm for computing minimal Mealy machines that 
are consistent with the input-output examples. ASTRA is a 
more general framework that incorporates new specification 
mechanisms, e.g., input-output types and distances, and uses 
them all together. Mealy machines are also synthesized from 
temporal specifications in reactive synthesis and regular model 
checking, where they are used to represent parameterized 
systems [20], [21]. This setting is orthogonal to ours as the 
specification is different and the transducer is again only a 
Mealy machine. 


The constraint encoding used in ASTRA is inspired by the 
encoding presented by Daniel Neider for computing minimal 
separating DFA, i.e. a DFA that separates two disjoint regular 
languages [22]. ASTRA’s use of weights and energy to specify 
a mean edit distance is based on energy games [23], a kind of 
2-player infinite game that captures the need for a player to 
not exceed some available resource. One way of solving such 
games is by defining a progress measure [15]. To determine 
whether a game has a winning strategy for one of the players, it 
can be checked whether such a progress measure exists in the 
game. We showed that the search for such a progress measure 
can be encoded as an SMT problem. 
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Abstract—Attribute grammars allow the association of semantic 
actions to the production rules in context-free grammars, pro- 
viding a simple yet effective formalism to define the semantics 
of a language. However, drafting the semantic actions can be 
tricky and a large drain on developer time. In this work, 
we propose a synthesis methodology to automatically infer the 
semantic actions from a set of examples associating strings 
to their meanings. We also propose a new coverage metric, 
derivation coverage. We use it to build a sampler to effectively 
and automatically draw strings to drive the synthesis engine. We 
build our ideas into our tool, PANINI, and empirically evaluate 
it on twelve benchmarks, including a forward differentiation 
engine, an interpreter over a subset of Java bytecode, and a 
mini-compiler for C language to two-address code. Our results 
show that PANINI scales well with the number of actions to be 
synthesized and the size of the context-free grammar, significantly 
outperforming simple baselines. 

Index Terms—Program synthesis, Attribute grammar, Seman- 
tic actions, Syntax directed definition 


I. INTRODUCTION 


Attribute grammars [1] provide an effective formalism to 
supplement a language syntax (in the form of a context-free 
grammar) with semantic information. The semantics of the 
language is described using semantic actions associated with 
the grammar productions. The semantic actions are defined in 
terms of semantic attributes associated with the non-terminal 
symbols in the grammar. 

Almost no modern applications use hand-written parsers 
anymore; instead, most language interpretation engines today 
use automatic parser generators (like YACC [2], BISON [3], 
ANTLR [4] etc.). These parser generators employ the sim- 
ple, yet powerful formalism of attribute grammars to couple 
parsing with semantic analysis to build an efficient frontend 
for language understanding. This mechanism drives many 
applications like model checkers (eg. SPIN [5]), automatic 
theorem provers (eg. Q3B [6], CVC5 [7]), compilers (eg. 
CIL [8]), database engines (eg. MYSQL [9]) etc. 

However, defining appropriate semantic actions is often not 
easy: they are tricky to express in terms of the inherited and 
synthesized attributes over the grammar symbols in the respec- 
tive productions. Drafting these actions for large grammars 
requires a significant investment of developer time. 

In this work, we propose an algorithm to automatically syn- 
thesize semantic actions from sketches of attribute grammars. 
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So EW output(E.val;) 
E> E+F 7 E.val + 
E-F’ E.val + 
F E.val + F.val; 
FoF *K El F.val + 
K [6] F.val + K.val; 
K >> K “num 7 K.val + 
SIN (K) 8 K.val + 
COS (K) Pl K.val + 
num [1°] K.val + getVal(num) + 0s; 
var [4] K.val + lookUp(Q, var) + le; 


Fig. 1: Attribute grammar for automatic forward differentiation 
(Q is the symbol table) 


Fig. 1 shows a sketch of an attribute grammar for automatic 
forward differentiation using dual numbers (we explain the 
notion of dual numbers and the example in detail in §III-A). 
The production rules are shown in green color while the 
semantic actions are shown in the blue color. Our synthesizer 
attempts to infer the definitions of the holes in this sketch (the 


function calls hî, h3, h3, h3, hs, hg); we show these holes 
in background. As an attribute grammar attempts to 


assign “meanings” to language strings, the meaning of a string 
in this language is captured by the output construct. 

This is a novel synthesis task: the current program synthesis 
tools synthesize a program such that a desired specification is 
met. In our present problem, we attempt to synthesize semantic 
actions within an attribute grammar: the synthesizer is required 
to infer definitions of the holes such that for all strings in the 
language described by the grammar, the computed semantic 
value (captured by the output construct) matches the intended 
semantics of the respective string—this is a new problem that 
cannot be trivially mapped to a program synthesis task. 

Our core observation to solve this problem is the follows: 
for any string in the language, the sequence of semantic 
actions executed for the syntax-directed evaluation of any 
string is a loop-free program. This observation allows us to 
reduce attribute grammar synthesis to a set program synthesis 
tasks. Unlike a regular program synthesis task where we 
are interested in synthesizing a single program, the above 
reduction requires us to solve a set of dependent program 
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Fig. 2: Parse tree for input x^°2+4xx+5 


synthesis instances simultaneously. These program synthesis 
tasks are dependent as they contain common components (to 
be synthesized) shared by multiple programs, and hence, they 
cannot be solved in isolation—as the synthesis solution from 
one instance can influence others. 

Given a set of examples F (strings that can be derived 
in the provided grammar) and their expected semantics O 
(semantic values in output ); for each example e; € E , we 
apply the reduction by sequencing the set of semantic actions 
for the productions that occur in the derivation of e;. This 
sequence of actions forms a loop-free program P;, with the 
expected semantic output o; € O as the specification. We 
collect such programs-specification pairs, (P;,0;), to create a 
set of dependent synthesis tasks. An attempt at simultaneous 
synthesis of all these set of tasks by simply conjoining the 
synthesis constraints does not scale. 

Our algorithm adopts an incremental, counterexample 
guided inductive synthesis (CEGIS) strategy, that attempts to 
handle only a “small” set of programs simultaneously—those 
that violate the current set of examples. Starting with only 
a single example, the set of satisfied examples are expanded 
incrementally till the specifications are satisfied over all the 
programs in the set. 

Furthermore, to relieve the developer from providing ex- 
amples, we also propose an example generation strategy for 
attribute-grammars based on a new coverage metric. Our 
coverage metric, derivation coverage, attempts to capture 
distinct behaviors due to the presence or absence of each 
of the semantic actions corresponding to the syntax-directed 
evaluation of different strings. 

We build an implementation, PANINI', that is capable 
of automatically synthesizing semantic actions (across both 
synthesized and inherited attributes) in attribute grammars. For 
the attribute grammar sketch in Fig. 1, PANINI automatically 
synthesizes the definitions of holes as shown in Fig. 3 in a 
mere 39.2 seconds. We evaluate our algorithm on a set of 
attribute grammars, including a Java bytecode interpreter and 


I PANINI A) was a Sanskrit grammarian and scholar in ancient India. 


a mini-compiler frontend. Our synthesizer takes a few seconds 
on these examples. 

To the best of our knowledge, ours is the first work at 
automatic synthesis of semantic actions on attribute grammars. 
The following are our contributions in this work: 

e We propose a new algorithm for synthesizing semantic 

actions in attribute grammars; 

e We define a new coverage metric, derivation coverage, 
to generate effective examples for this synthesis task; 

e We build our algorithms into an implementation, PANINI, 
to synthesize semantic actions for attribute grammars; 

e We evaluate PANINI on a set of attribute grammars 
to demonstrate the efficacy of our algorithm. We also 
undertake a case-study on the attribute grammar of the 
parser of the SPIN model-checker to automatically infer 
the constant-folding optimization and abstract syntax tree 
construction. 

An extended version of this article is available [10]. The 

implementation and benchmarks of PANINI are available at 
https:/github.com/pkalita595/Panini. 


II. PRELIMINARIES 


Attribute grammars [1] provide a formal mechanism to 
capture language semantics by extending a context-free gram- 
mar with attributes. An attribute grammar G is specified by 
(S, P,T,N,F,T), where 

e T and N are the set of terminal and non-terminal symbols 

(resp.), and S € N is the start symbol; 

e A set of (context-free) productions, p; E€ P, where p; : 
Xi — YirYi2--- Yin; a production consists of a head X; € 
N and body Yj, ...Yin, such that each Y;, € TUN. 

e A set of semantic actions fi € F; 

e I : P — F is a map from the set of productions P to 
the set of semantic actions f; € F. 

The set of productions in G describes a language (denoted as 
L£(G)) to capture the set of strings that can be derived from S. 
A derivation is a sequence of applications of productions p; € 
P that transforms S to a string, w € L(G); unless specified, we 
will refer to the leftmost derivation where we always select the 
leftmost non-terminal for expansion in a sentential form. As 
we are only concerned with parseable grammars, we constrain 
our discussion in this paper to unambiguous grammars. 

The semantic actions associated with the grammar produc- 
tions are defined in terms of semantic attributes attached to 
the non-terminal symbols in the grammar. Attributes can be 
synthesized or inherited: while a synthesized attributes are 
computed from the children of a node in a parse tree, an 
inherited attribute is defined by the attributes of the parents 
or siblings. 

Fig. 1 shows an attribute grammar with context-free pro- 
ductions and the associated semantic actions. Fig. 2 shows the 
parse tree of the string x? +4a +5 on the provided grammar; 
each internal node of the parse tree have associated semantic 
actions (we have only shown the “unknown” actions that need 
to be inferred). 
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hy (a4 + agé&, by + bye): 


reath; r 4+ a — bi 
d <a + bg d+ ag — b2 
return r + de return r + de 


(a) 
hs (a, + age): 
r © sin(a) 
d + az * cos(a) 
return r + de 


(d) 


hg (ay + a2): 
r + cos(ay) 


return r + de 


hs (a, + age, by + b28): 


(b) 


d + ag x sin(a) * —1 


(e) 


hg (a, + age, bı + b28): 
r <4 a * bi 
d€ as * bı + a1 * be 
return r + de 
(c) 
hg (a1 + a28, Cc): 
r + pow(a1, c) 
d + az x pow(a1,¢— 1) 
return r + de 


(f) 


Fig. 3: Synthesized holes for holes in Fig. 1 


Parser generators [2] accept an attribute grammar and 
automatically generate parsers that perform a syntax-directed 
evaluation of the semantic actions. For ease of discussion, we 
assume that the semantic actions are pure (i.e. do not cause 
side-effects like printing values or modifying global variables) 
and generate a deterministic output value as a consequence of 
applying the actions. 

An attribute grammar is non-circular if the dependencies 
between the attributes in every syntax tree are acyclic. Non- 
circularity is a sufficient condition that all strings have unique 
evaluations [11]. 


Notations. We notate production symbols by serif fonts, non- 
terminal symbols (or placeholders) by capital letters (eg. X) 
and terminal symbols by small letters (eg. a). Sets are denoted 
in capital letters. We use arrows with tails (—) in productions 
and string derivations to distinguish it from function maps. 
We use the notation e[g:/g2] to imply that all instances of 
the subexpression gə are to be substituted by gı within the 
expression e. We use the notation of Hoare logic [12] to 
capture program semantics: {P}Q{R} implies that if the 
program Q is executed with a precondition P, it can only 
produce an output state in R; P and R are expressed in some 
base logic (like first-order logic). 


III. OVERVIEW 


Sketch of an attribute grammar. We allow the sketch G° of 
an attribute grammar (as syntax directed definition (SDD)), 
G° (S,P,T,N,H°,T), to contain holes for unspecified 
functionality within the semantic actions h? €e H°. For 
example, in Fig. 1, the set of holes comprises of the functions 
H = {h$, h3, h3, h3, h}, hg}. If the semantic action corre- 
sponding to a production p contains hole(s), we refer to the 
production p as a sketchy production; when the definitions for 
all the holes in a sketchy production are resolved, we say that 
the production is ready. The completion (denoted Gt/1»--fn}) 
of a grammar sketch G° denotes the attribute grammar where 
a set of functions f1,..., fn replace the holes hf,...,h?. 
We denote the syntax-directed evaluation of a string w on 
an attribute grammar G as [wl?; we consider that any such 
evaluation results in a value (or L if w ¢ G). 


Example Suite. An example (or test) for an attribute grammar 
G can be captured by a tuple (w,v) such that w € L(G) and 


[w]% = v. A set of such examples constitutes an example 
suite (or test suite). 

If the language described by the grammar G supports vari- 
ables, then any evaluation of G needs a context, 3, that binds 
the free variables to input values. We denote such examples 
as [wf v. When the grammar used is clear from the 
context, we drop the superscript and simplify the notation to 
[w], = v. Consider the example |[x*3],-2 = 8 + 12e 
where “x*3” is a string from the grammar shown in Fig. 
and the input string evaluates to 8 + 12e under the binding 
xX 2. Clearly, if the language does not support variables, 
the context 8 is always empty. 


I> 


Problem Statement. Given a sketch of an attribute 
grammar, G°, an example set E and a domain-specific 
language (DSL) D, synthesize instantiations of the 
holes by strings, w € D, such that the resulting 
attribute grammar agrees with all examples in E. 


In other words, PANINI synthesizes functions fi,..., fn 
in the domain-specific language D such that the completion 
gif: vofa} satisfies all examples in E. 


A. Motivating example: Automated Synthesis of a Forward 
Differentiation Engine 


We will use synthesis of an automatic forward differen- 
tiation engine using dual numbers [13] as our motivating 
example. We start with a short tutorial on how dual numbers 
are used for forward differentiation. 

1) Forward Differentiation using Dual numbers: Dual num- 
bers, written as a + be, captures both the value of a function 
f(x) (in the real part, a), and that its differentiation with 
respect to the variable x, f'(x), (in the dual part, b)—within 
the same number. Clearly, a,b € R and we assume £? = 0 
(as it refers to the second-order differential, that we are not 
interested to track). The reader may draw parallels to complex 
numbers that are written as a + ib, where ‘i’ identifies the 
imaginary part, and i? = —1. 

Let us understand forward differentiation by calculating 
f'(x) at x = 3 for the function f(x) = x? + 4s +5. 

First, the term x needs to be converted to a dual number at 
x = 3. For x = 3, the real part is clearly 3. To find the dual 
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part, we differentiate the term with respect to variable zx, i.e. 


T that evaluates to 1. Hence, the dual number representation 
of the term x at x = 3 is 3 + le. 
Now, dual value of the term x? can be computed simply by 
taking a square of the dual representation of x: 
2 


T x x 
—— 


m ——. 
(3 + le)? = (3 + le) * (3 + le) = 3°+(2*3xe)+e? = 94+6e+0 
Finally, the dual number representation for the constant 4 is 
4+ 0e (as differentiation of constant is 0). Similarly, the dual 
value for 4x: (4 + 0e) x (3 + le) = 12 + 4e + 0. So, we can 

compute the dual number for f(x) = x? + 4x + 5 as: 
2 
(9 + 6e) + (12 + 4e) + (5 + 0e) = 26 + 10e 

Hence, the value of f(x) at x = 3 is f(3) = 26 (real part 
of the dual number above) and that of its derivative, f'(x) 
2x +4 is f’(3) = 10, which is indeed given by the the dual 
part for the dual number above. 

2) Synthesizing a forward differentiation engine: The at- 
tribute grammar in Fig. | (adapted from [14]) implements for- 
ward differentiation for expressions in the associated context- 
free grammar; we will use this attribute grammar to illustrate 
our synthesis algorithm. lookUp(Q, var) returns the value of 
the variable var from symbol table 2. 

We synthesize programs for the required functionalities for 
the holes from the domain-specific language (DSL) shown in 
Equation 1. Function pow (a,c) calculates a raised to the 
power of c. We assume the availability of an input-output 
oracle, Oracle(w(f)), that returns the expected semantic 
value for string w under the context 2. 


Fun ::= C + Ce 
C ::= var|num|1|0|-C|C+C|C-C|Cx«C 
| sin(C) | cos(C) | pow(C, num) 
(1) 
B. Synthesis of semantic actions 

Our core insight towards solving hg (aa + age, bi + 
this synthesis problem is that the be): 2 ; 
sequence of semantic actions cor- f wi 

- ‘ 4— bı + b2 + 
responding to the syntax-directed 3 x a2 
evaluation of any string on the at- 

: return r + de 
tribute grammar constitutes a loop- 
free program. 

Fig. 5 shows the loop free Fig. 4: Wrong 
program from the semantic definition of h3 
evaluation of the example 

[x^2+4*x+5]s=3 = 26 + 10e; the Hoare triple 
captures the synthesis constraints over the holes. 

Similarly, our algorithm constructs constraints (as 


Hoare triples) over the set of all examples Æ (eg. 
[x+x]e=13 = 26 + 2€), (3-x]2=7 = -4 - lel, 


[x*x]o—4 = 16+ 8é], 


[sin(x*2) eas = 0.41 = 5.47el, 


-0.65 
-2.61 


+ 3.02], 
+ 2.378): 


[cos (x*2) Ja=2 


[x*cos (x) Je=a 


Synthesizing definitions for holes that satisfy Hoare triple 
constraints of all the above examples yields a valid completion 
of the sketch of the attribute grammar (see Fig. 3). As the 
above queries are “standard” program synthesis queries, they 
can be answered by off-the-shelf program synthesis tools 
[15], [16]. Hence, our algorithm reduces the problem of 
synthesizing semantic actions for attribute grammars to solving 
a conjunction of program synthesis problems. 


While the above 

conjunction can be {x = 3} 

easily folded into a K,.val © 3 + 1e; 

. . i , 

single program synthesis K2.val + h$(K;.val, 2); 
query and offloaded to a K3.val + 4 + Oe; 

. f ? 

program synthesis tool, |p, val + h3 (K3.val, K,.val); 
quite understandably, | p yal + h? (Kp.val, F,.val); 
it will not scale. To Ka.val + 5 + Oe; 


scale the above problem, 
we employ a refutation- 
guided inductive synthesis 


output + hf (E;.val, K,.val); 
{output = 26 + 10e} 


procedure: we sort the 
set of examples by fig. 5: Hoare triple constraint 
increasing complexity, for z? +4245 at 7=3 


completing the holes for 

the easier instances first. The synthesized definitions are 
frozen while handling new examples; however, unsatisfiability 
of a synthesis call with frozen procedures refutes the prior 
synthesized definitions. Say we need to synthesize definitions 
for {h§,...,h§} and examples {e1,...,e;-1} have already 
been handled, with definitions {hf = fi,...,h8 = fs} 
already synthesized. To handle a new example, e;, we 
issue a synthesis call for procedures {h@,...,h§} with 
definitions {hf = fi,...,h% = fs} frozen. Say, the constraint 
corresponding to e; only includes calls {h§,h{, hg, hg} and 
the synthesis query is unsatisfiable. In this case, we unfreeze 
only the participating frozen definitions (i.e. {h$,h$}) and 
make a new synthesis query. As new query only attempts 
to synthesize a few new calls (with many participating calls 
frozen to previously synthesized definitions), this algorithm 
scales well. 


For example, consider the grammar in Fig. 1: the loop- 
free program resulting from the semantic evaluation of 


the input [Perle = 26 + 2¢| (say trace tı) includes 
only one Ah}. Hence, we synthesize h? with only the con- 


straint {x 13}t) {output 26 + 2e}, that results 
in the definition shown in Fig. 3a. Next, we consider 
the input [Txvlons = 16 + 8eļ its constraint includes 
the holes A3, which is synthesized as the function def- 
inition shown in Fig. 4. Now, with {h{,h§} frozen to 
their respective synthesized definitions, we attempt to han- 
dle | [x*cos(x)]e=4 = -2.61 + 2.37e)|. Its constraint 
includes the holes {h§, hg}; now we only attempt to synthesize 


hg while constraining h§ to use the definition in Fig. 4. 
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In this case the synthesizer fails to synthesize hg since the 
synthesized definition for h$ is incorrect, thereby refuting the 
synthesized definition of h§. Hence, we now unfreeze h§ and 
call the synthesis engine again to synthesize both h§ and hg 
together. This time we succeed in inferring correct synthesized 
definition as shown in Fig. 3. 


C. Example Generation 


We propose a new coverage metric, derivation coverage, 
to generate good samples to drive synthesis. Let us explain 
derivation coverage with an example, | [x*2+4*x+5].—=3 
from the grammar in Fig. 1. The leftmost derivation of this 
string covers eight productions, ({1, 2, 4, 5, 6, 7, 10, 11}) 
out of a total of 11 productions. Intuitively, it implies that the 
Hoare triple constraint from its semantic evaluation will test 
the semantic actions corresponding to these productions. 

Similarly, the Hoare logic constraint from the example 

[x°2+7*x+sin (x) ],—2 | will cover 9 of the productions, 

{1, 2, 4, 5, 6, 7, 8, 10, 11}. As it also covers the semantic 
action for the production 8, it tests an additional behavior 
of the attribute grammar. On the other hand, the example 
[x*3+5x]—s | invokes the productions, {1, 2, 4, 5, 6, 7, 10, 
11}. As all these semantic actions have already been covered 
by the example | [x*2+4*x+5].,—3 |, it does not include the 
semantic action of any new set of productions. 

In summary, derivation coverage attempts to abstract the 
derivation of a string as the set of productions in its leftmost 
derivation. It provides an effective metric for quantifying the 
quality of an example suite and also for building an effective 
example generation system. 


Validation. Our example generation strategy can start off by 
sampling strings w from the grammar (that improve derivation 
coverage), and context 8; next, it can query the oracle for 
the intended semantic value v = Oracle(w(f)) to create an 
example [wli =v. 

Consider that the algorithm finds automatically an example 
[Ix 2+4+x+5]ens . Now, there are two possible, seman- 
tically distinct definitions that satisfy the above constraint 
(see Fig. 3c and Fig. 4), indicating that the problem is 
underconstrained. Hence, our system needs to select additional 
examples to resolve this. One solution is to sample multiple 
contexts on the same string to create multiple constraints: 

e {x = 2} Ky.val + 2 + 1e;... {output = 13 + 6e} 

e {x = 4} Ki.val + 4 + 1e;... {output = 29 + 10e} 

The above constraints resolve the ambiguity and allows 
the induction of a semantically unique definitions. The check 
for semantic uniqueness can be framed as a check for dis- 
tinguishing inputs: given a set of synthesized completion 
Gitfi-fx} we attempt to synthesize an alternate completion 


ganca and an example string w (and context 6) such 
fioo fn Drs In 
that [wli i £ [wli i . In other words, for 


the same string (and context), the attribute grammar returns 
different evaluations corresponding to the two completions. 


For example, | [x*2+4*x+5].—9 


is a distinguishing inputs 


Algorithm 1: SYNTHHOLES(G*,T, R, D) 
1y¢T; 

2 G? + G°[R]; 

3 for (w, v) € T do 

4 | t+ GENTRACE([w]ge); 

5 pe vd (out(t) = v); 

6 B «+ SYNTHESIZE(y, D); 

7 return B; 


witnessing the ambiguity between the definitions shown in 
Fig. 3c and Fig. 4. 

On the other hand, the algorithm could have sampled other 
strings (instead of contexts) for additional constraints. PANINI 
prefers the latter; that is, it first generates a good example suite 
(in terms of derivation coverage) and only uses distinguishing 
input as a validation (post) pass. If such inputs are found, 
additional contexts are added to resolve the ambiguity. 

We provide the detailed algorithm of example generation in 
the extended version [10]. 


IV. ALGORITHM 


Given an attribute grammar G°, a set of holes h; € H, 
a domain-specific language D, an example suite Æ and a 
context 3, PANINI attempts to find instantiations g; for hi 
such that, 


Find{g,,...,gj1\} € D such that Y(s, 8,v) € F. [s]§ =v 
(2) 
where the attribute grammar G = G*[gi/hi,..., 9) 4\/h\ HI] 


and variable bindings 8 maps variables in s to values. 


A. Basic Scheme: ALLATONCE 


Our core synthesis procedure (Algorithm 1), SYNTH- 
HoLes(G®, E, R, D), accepts a sketch G°, an example (or 
test) suite FE, a set of ready functions R and a DSL D; 
all holes whose definitions are available are referred to as 
ready functions. When SYNTHHOLES is used as a top-level 
procedure (as in the current case), R = 0; if not empty, the 
definitions of the ready functions are substituted in the sketch 
G° to create a new sketch G? on the remaining holes (Line 2). 
We refer to the algorithm where R = 9 at initialization as the 
ALLATONCE algorithm. 

Our algorithm exploits the fact that a syntax-directed seman- 
tic evaluation of a string w on an attribute grammar G produces 
a loop-free program. It attempts to compute a symbolic 
encoding of this program trace in the formula (initialized 
to true in Line 1). GENTRACE() instruments the semantic 
evaluation on the string w to collect a symbolic trace (the loop- 
free program) consisting of the set of instructions encountered 
during the syntax-directed execution of the attribute grammar 
(Line 4); an output from an operation that is currently a hole 
is appended as a symbolic variable. The assertion that the 
expected output v matches the final symbolic output out(t) 
from the trace t is appended to the list of constraints (Line 5). 
Finally, we use a program synthesis procedure, Synthesize with 
the constraints y in an attempt to synthesize suitable function 
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definitions for the holes in y (Line 6). Given a constraint 
in terms of a set of input vector 7 and function symbols 
(corresponding to holes) h, 


SYNTHESIZE((Z,h)) := h such that 5h. YZ. y(Z,h) (3) 


We will use an example from forward differen- 
tiation (Fig. 1) to illustrate this. Let us consider 
two inputs, [x*°2-2+x]na3 = 3 + 4e and 

[3*x+6]o—2 12+3e | For the first input 


[3*x+6]r=2 = 12 + 3e|, the procedure GENTRACE() 
(Line 4) generates a symbolic trace (denoted t1): 


{zı = 2+ 1e; ay = h3 (3 +02, x1); oute = hf (a1,6+0e); } 
The following symbolic constraint is generated from above 
trace: 

Pu = (zı = 2+leAa, = h§(3+0e, v1) ^ outi = hî (a, 6)) 


In the trace tı, operations h$ and h§ are holes and variables, 
i.e., @1, 0ut,, are the fresh symbolic variables. In the next step 
(line 5), the constraints generated from trace tı is added, 


p = TA (pu A outa = 12 + 32) 


In the next iteration of the loop at line 3, the algorithm will 
take the second input, (i.e., | |x^°2-2*xļ]z=3 = 3 + 4e). In 
this case, GENTRACE() will generate following trace (t2), 


{z2 = 3 + le; az = hi (z2,2); ag = h3(2 + 0e, x2); 
outi = h3 (a3, a4)} 
The generated constraints from t will be, 
Pr =(£2 = 3 + le A ag = hj (£2, 2) A ag = h§(2 + 0e, z2) 
A outig = h5 (a3, a4)) 
At line 5, new constraints will be, 
p +| TA (gn A outs = 12 + 32) A (pt2 A outi = 3 + 4) 


At line 6, with y as constraints, the algorithm will attempt to 
synthesize definitions for the holes (i.e., AÌ, h3, h3 and h$). 


B. Incremental Synthesis 


The ALLATONCE algorithm exhibits poor scalability with 
respect to the size of the grammar and the number of examples. 
The route to a scalable algorithm could be to incrementally 
learn the definitions corresponding to the holes and make use 
of the functions synthesized in the previous steps to discover 
new ones in the subsequent steps. 

However, driving synthesis one example at a time will lead 
to overfitting. We handle this complexity with a two-pronged 
strategy: (1) we partition the set of examples by the holes 
for which they need to synthesize actions, (2) we solve the 
synthesis problems by their difficulty (in terms of the number 
of functions to be synthesized) that allows us to memoize their 
results for the more challenging examples. We refer to this 
example as the INCREMENTALSYNTHESIS algorithm. 


Algorithm 2: SYNTHATTRGRAMMAR(G®, E, D) 


1T¢9; 

2 Reģ; 

3 while T Æ E do 

4 (w, v} < SELECTEXAMPLE(E \ T); 
5 Z + GETSKETCHYPRODS(G°, w); 
6 if Z C R then 

7 G? + G° [R]; 

8 if [w]e = v then 

9 T TU {(w,v)}; 

10 | continue; 

11 else 

12 Re R\Z; 

13 | Te + TU {(w,v)}; 

14 else 

15 Te + {(wi, vi) | w S wi, (wi, vi) € E}; 


16 B 4+ SYNTHHOLES(G®, Te, R, D); 
17 if B = @ then 


18 if RA Z Æ 0 then 

19 Re R\Z; 

20 | B+<SYNTHHOLES(G®, T U Te, R, D); 
21 if B = 0 then 

22 | return 0; 


23 Ry 4+ {(pi : {..., hi > B[hi],...}) | 

pi E€ Z \ R, hi € holes(T (p:))}; 
24 Re RU Rş; 

25 T TU Te; 


26 return R; 


Derivation Congruence. We define an equivalence relation, 
derivation congruence, on the set of strings w € L(G): 
strings w1, W2 € L(G) are said to be derivation congruent, 
w, g w w.r.t. the grammar G, if and only if both the 
strings w , and w2 contain the same set of productions in 


their respective derivations. For example, w, | [3*x+5].—2 | 
we | [5*x+12].—3 | and ws | [4*x+7*x] r= | 


Note that though the strings w1, we and wz are derivation 
congruent to each other, while w and wz have similar parse 
trees, w3 has a quite different parse tree. So, intuitively, 
all these strings are definition congruent to each other, as, 
even with different parse trees, they involve the same set of 
productions ({1, 2, 4, 5, 6, 10, 11}) in their leftmost derivation. 

Algorithm 2 shows our incremental synthesis strategy. Our 
algorithm maintains a set of examples (or tests) T (line 2) that 
are consistent with the current set of synthesized functions for 
the holes; the currently synthesized functions (referred to as 
ready functions), along with the respective ready productions, 
are recorded in R (line 1). The algorithm starts off by selecting 
the easiest example (w,v) at line 4 such that the cardinality 
of the set of sketchy production, Z, in the derivation of w 
is the minimal among all examples not in T. The set R 
maintains a map from the set of sketchy productions to a set of 
assignments to functions synthesized (instantiations) for each 
hole contained in the respective semantic actions. 

If all sketchy productions, Z, in the derivation of w are 
now ready, we simply test (line 6) to check if a syntax- 
guided evaluation with the currently synthesized functions in 
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R yield the expected value v: if the test passes, we add the 
new example to the set of passing examples in T (line 9). 
Otherwise, as the current hole instantiations in R is not 
consistent for the derivation of w, at line 12 we remove 
all the synthesized functions participating in syntax-directed 
evaluation of w (which is exactly Z). Furthermore, removal 
of some functions from R requires us to re-assert the new 
functions on all the past examples (contained in T’) in addition 
to the present example (line 13). 

If all the sketchy productions (Z) in the derivation of w are 
not ready, we attempt to synthesize functions for the missing 
holes, with the set of current definitions in R provided in the 
synthesis constraint. 

The synthesis procedure (line 16), if successful, yields a set 
of function instantiation for the holes. In this case, the solution 
set from B is accumulated in R, and the set of passing tests 
extended to contain the new examples in Te. 

However, synthesis may fail as some of the current defi- 
nitions in R that were assumed to be correct and included 
in the synthesis constraint is not consistent with the new 
examples (Te). In this case, we remove the instantiations of 
all such holes occurring in the syntax-directed evaluation of 
w (line 19) and re-attempt synthesis (line 20). If this attempt 
fails too, it implies that no instantiation of the holes exist in 
the provided domain-specific language D (line 22). 

We provide a detailed example on the run of this algorithm 
in the extended version [10]. 


Theorem. If the algorithm terminates with a non-empty set of 
functions, G° instantiated with the synthesized functions will 
satisfy the examples in E; that is, the synthesized functions 
satisfy Equation 2. 

The proof is a straightforward argument with the inductive 
invariant that at each iteration of the loop, the G° instantiated 
with the functions in R satisfy the examples in T. 


V. EXPERIMENTS 


Our experiments were conducted in Intel(R) Xeon(R) CPU 
E5-2620 @ 2.00GHz with 32 GB RAM and 24 cores on a set 
of benchmarks shown in Table I. PANINI uses FLEX [17] and 
BISON [3] for performing a syntax-directed semantic evalu- 
ation over the language strings. PANINI uses SKETCH [15] 
to synthesize function definitions over loop-free programs, 
and the symbolic execution engine CREST [18] for generating 
example-suites guided by derivation coverage. 

We attempt to answer the following research questions: 

e Can PANINI synthesize attribute grammars from a variety 

of sketches? 

e How do INCREMENTALSYNTHESIS and ALLATONCE 

algorithms compare? 

e How does PANINI scale with the number of holes? 

e How does PANINI scale with the size of the grammar? 

The default algorithm for PANINI is the INCREMENTAL- 
SYNTHESIS algorithm; unless otherwise mentioned, PANINI 
refers to the implementation of INCREMENTALSYNTHESIS 
(Algorithm 2) for synthesis using examples generation guided 
by derivation coverage (detailed explanation available in the 


extended version [10]) . While ALLATONCE works well for 
small grammars with few examples, INCREMENTALS YNTHE- 
SIS scales well even for larger grammars, both with the number 
of holes and size of the grammar. 

PANINI can synthesize semantic-actions across both syn- 
thesized and inherited attributes. Some of our benchmarks 
contain inherited-attributes: for example, benchmark b8 uses 
inherited-attributes to pass the type information of the vari- 
ables. Inherited-attributes pose no additional challenge; they 
are handled by the standard trick of introducing “marker” non- 
terminals [19]. 


A. Attribute Grammar Synthesis 


We evaluated PANINI on a set of attribute grammars adapted 
from software in open-source repositories [14], [19]-[24]. 
Table I shows the benchmarks, number of productions (#P), 
number of holes (#H), input example, solving time (Time, 
AAO for ALLATONCE and IS for INCREMENTALS YNTHE- 
SIS) and number of times a defined function was refuted (#R). 
Please recall that ALLATONCE refers to Algorithm 1 (§IV-A) 
and INCREMENTALS YNTHESIS refers to Algorithm 2 (IV-B). 

We provide more detailed descriptions of the benchmarks 
bl to b9 in the extended version [10]. The benchmark b10 
is the forward differentiation example described in $III-A. 
Benchmarks b11 and b12 are quite complex benchmarks that 
interpret a (subset) of Java bytecode and compile C code: 


b11 Bytecode interpreter. Interpreter for a subset of Java 
bytecode; it supports around 36 instructions [25] of dif- 
ferent type, i.e., load-store, arithmetic, logic and control 
transfer instructions. 


b12 Mini-compiler. Fig. 6 shows the different steps of 
synthesizing semantic actions in mini-compiler. Fig. 6b is 
a sample input for the mini-compiler. Fig. 6a shows snip- 
pet of the attribute grammar for mini-compiler. Fig. 6c 
shows the two-address code generated from the input 
code shown in Fig. 6b, where h and hẹ are two holes 
in the two-address code. Finally, in Fig. 6d shows the 
synthesized definition for h? and hẹ in the target language 
for two-address code. 

Fig. 8 attempts to capture the fraction of time taken by the 
different phases of PANINI: example generation and synthesis. 
Not surprisingly, the synthesis phase dominates the cost as it 
requires several invocation of the synthesis engines, whereas, 
the example generation phase does not invoke synthesis en- 
gines or smt solvers. Further, the difference in time spend 
in these two phases increases as the benchmarks get more 
challenging. 


B. ALLATONCE v/s INCREMENTALS YNTHESIS 


1) Scaling with holes: Fig. 9a and Fig. 9b show PANINI 
scales with the sketches with increasingly more holes. We 
do this study for forward differentiation (b10) and bytecode 
interpreter (b11). As can be seen, PANINI scales very well. On 
the other hand, ALLATONCE works well for small instances 
but soon blows up, timing out on all further instances. The 
interesting jump in b10 (at #Holes=8) was seen when we 
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S > MAIN B 


MAIN B 
vee ab: 
AT = argı arg2 dst emit(“load rı a”) 
E+T A.val = ` int main{ assign 4 7 emit(“load r2 b” ) 
E-T aes ‘ int a, b, c; Ra a 9 TO emit( “plus rı r2”) 
T oF è = 4; e assign TO , A oo 
= ąa + s a 
TF T.val = ; i emit(“load rı a”) 
c=a-b; // assign T1 T2 emit( “load r2 b”) 
T/F T.val = ; ret T2 afa ” 
return c; emit( “sub rı r2”) 
} (c) Three-address code 
(a) Attribute grammar sketch for generated from C code (d) Synthesized definition 
mini-compiler (b) A simple C code in Fig. 6b for h$, hè 
Fig. 6: Synthesis of mini-compiler (b12) 
TABLE I: Description of benchmarks 
: init { 
Id Benchmark #P | #H | Example #R Time (s) run Foo(8+(6-7)); 
AAO IS } 
bl Count ones 5 1 11001 0 3.2 3.1 
b2 Binary to integer 5 1 01110 0 3.6 2.9 
b3 Prefix evaluator J 4 +34 0 TO 10.1 (e) PROMELA source code 
b4 Postfix evaluator 7 4 234 * + 0 TO 10.5 node n1 = node (val=8) ; 
b5 Arithmetic calculator 8 4 5*24+8 0 TO 12.8 node n2 = node (val=6) ; 
b6 Currency calculator 10 4 USD 3 + INR 8 0 TO 13.6 node n3 = node (val=7) ; 
b7 if-else calculator 10 4 i£(3+4 == 3) 1 TO 21.7 
then 44; node n4 = (n2, n3); 
= else 73; node n5 = (nl, n4); 
b8 Activation record layout 10 3 int a, D; 0 TO 13.8 
b9 Type checker ll 5 (5 —- 2) == 3 I TO 15.4 node n6 = CFoo ,n5) ; 
b10 Forward differentiation 20 12 x*pow(x,3) 2 TO 39.2 
b11 Bytecode interpreter 39 36 | bipush 3; 3 TO 141.4 (f) Trace generated 
bipush 4; 
iadd; f Fig. 7: Trace generation for AST 
b12 Mini-compiler 43 6 int main () 0 TO 9.2 : 
return 44s) construction 
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Fig. 8: Stacked bar graph for the % time spent in example creation 
and synthesis 


started adding holes for the definitions of the more complex 
operators like sin () and cos (). 

2) Scaling with size of grammar: Table I shows that 
INCREMENTALSYNTHESIS scales well with the size of the 
grammar (by the number of productions). On the other hand, 
ALLATONCE works well for the benchmarks b1 and b2 as 
they have only one hole while it times out for the rest. 


The complexity of INCREMENTALSYNTHESIS is indepen- 
dent of the size of the attribute-grammar but dependent on the 
length of derivations and the size of the semantic actions. The 
current state of synthesis-technology allows PANINI to synthe- 
size practical attribute grammars that have a large number of 
productions but mostly “small” semantic actions and where 
short derivations can “cover” all productions. Further, any 
improvement in program-synthesis technology automatically 
improves the scalability of PANINI. 


VI. CASE STUDY 


We undertook a case-study on the parser specification of the 
SPIN [5] model-checker. SPIN is an industrial-strength model- 
checker that verifies models written in the PROMELA [26] 
language against linear temporal logic (LTL) specifications. 
SPIN uses YACC [2] to builds its parser for PROMELA. 
The modelling language, PROMELA, is quite rich, supporting 
variable assignments, branches, loops, arrays, structures, pro- 
cedures etc. The attribute grammar specification in the YACC 
language is more than 1000 lines of code (ignoring newlines) 
having 280 production rules. 

The semantic actions within the attribute grammar in the 
YACC description handle multiple responsibilities. We selected 
two of its operations: 
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35| (Algorithm 2) P - mha oon int flags[(5 * 25) - 42]; 
gorithm r 
int v = flags[10 - 4 + (9 / 3)]; 
30 150 } 
2 25 Z125 
= E 
Pa E100 | (a) PROMELA source 
E a init { 
50 int flags[83]; 
10 ; 
25 int v = flags[9]; 
5 Cae 


10 15 20 
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(a) Forward differentiation (b10) (b) Java bytecode (b11) 
Fig. 9: #Hole v/s Time for benchmarks b10 and b11 


a) Constant folding array indices: As the PROMELA 
code is parsed, semantic actions automatically constant-fold 
array indices (see Fig. 10). We removed all the actions 
corresponding to constant-folding by inserting 8 holes in the 
relevant production rules (these correspond to the non-terminal 
const_expr). The examples to drive the synthesis consisted 
of PROMELA code with arrays with complex expressions and 
the target output was the optimized PROMELA code. PANINI 
was able to automatically synthesize this constant-folding 
optimization within less than 4 seconds. 

b) AST construction: A primary responsibility of the 
semantic analysis phase is to construct the abstract syntax 
tree (AST) of the source PROMELA code. We, next, attempted 
to enquire if PANINI is capable of this complex task. 

In this case, each example includes a PROMELA code as 
input and a tree (i.e. the AST) as the output value. We removed 
the existing actions via 23 holes. These holes had to synthesize 
the end-to-end functionality for a production rule with respect 
to building the AST: that, the synthesized code would decides 
the type of AST node to be created and the correct order of 
inserting the children sub-trees. 

Run of the example suite on the sketchy productions 
generates a set of programs (one such program shown in 
Fig. 7); these programs produce symbolic ASTs that non- 
deterministically assigns type to nodes and assigns the children 
nodes. We leverage the support of references in Sketch to 
define self-referential nodes. 

We insert constraints that establish tree isomorphism by 
recursively matching the symbolic ASTs with the respective 
output ASTs (available in example suite); for example, in 
Fig. 11 isomorphism constraints are enforced on the concrete 
and the symbolic ASTs. Sketch resolves the non-determinism 
en route to synthesizing the relevant semantic actions. In 
this case-study, PANINI was able to synthesize the actions 
corresponding to the 23 holes within 20 seconds. 


VII. RELATED WORK 
Program synthesis is a rich area with proposals in varying 
domains: bitvectors [27], [28], heap manipulations [29]-[33], 
bug synthesis [34], differential privacy [35], [36], invariant 


25 


} 


30 35 
(b) PROMELA optimized 


Fig. 10: Constant folding in PROMELA 
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(a) Desired concrete AST (b) Symbolic AST 


Fig. 11: Desired AST for code in Fig. 7 and symbolic AST. 
Grey lines (in Fig. 11b) denote symbolic choices. 


generation [37], Skolem functions [38]-[40], synthesis of 
fences and atomic blocks [41] and even in hardware secu- 
rity [42]. However, to the best of our knowledge, ours is the 
first work on automatically synthesizing semantics actions for 
attribute grammars. 


There has some work on automatically synthesizing parsers: 
PARSIFY [43] provides an interactive environments to auto- 
matically infer grammar rules to parse strings; it is been shown 
to synthesize grammars for Verilog, Tiger, Apache Logs, and 
SQL. CYCLOPS [44] builds an encoding for Parse Conditions, 
a formalism akin to Verification Conditions but for parseable 
languages. Given a set of positive and negative examples, 
CYCLOPS, automatically generates an LL(1) grammar that 
accepts all positive examples and rejects all negative examples. 
Though none of them handle attribute grammars, it may be 
possible to integrate them with PANINI to synthesize both the 
context-free grammar and the semantic actions. We plan to 
pursue this direction in the future. 


We are not aware of much work on testing attribute 
grammars. We believe that our derivation coverage metric 
can also be potent for finding bugs in attribute grammars, 
and can have further applications in dynamic analysis [45]- 
[47] and statistical testing [48], [49] of grammars. However, 
the effectiveness of this metric for bug-hunting needs to be 
evaluated and seems to be a good direction for the future. 
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Abstract—Reactive synthesis builds a system from a specifi- 
cation given as a temporal logic formula. Traditionally, reactive 
synthesis is defined for systems with Boolean input and output 
variables. Recently, new techniques have been proposed to extend 
reactive synthesis to data domains, which are required for 
more sophisticated programs. In particular, Temporal stream 
logic (TSL) extends LTL with state variables, updates, and 
uninterpreted functions and was created for use in synthesis. We 
present a new synthesis procedure for TSL(T), an extension of 
TSL with theories. Our approach is also able to find predicates, 
not present in the specification, that are required to synthesize 
some programs. Synthesis is performed using two nested counter- 
example guided synthesis loops and an LTL synthesis procedure. 
Our method translates TSL(T) specifications to LTL and extracts 
a system if synthesis is successful. Otherwise, it analyzes the 
counterstrategy for inconsistencies with the theory, these are then 
ruled out by adding temporal assumptions, and the next iteration 
of the loop is started. If no inconsistencies are found the outer 
refinement loop tries to identify new predicates and reruns the 
inner loop. A system can be extracted if the LTL synthesis returns 
realizable at any point, if no more predicates can be added the 
problem is unrealizable. The general synthesis problem for TSL is 
known to be undecidable. We identify a new decidable fragment 
and demonstrate that our method can successfully synthesize or 
show unrealizability of several non-Boolean examples. 


I. INTRODUCTION 


Reactive synthesis [1] is the problem of automatically 
constructing a system from a specification. The user provides 
a specification in temporal logic and the synthesis procedure 
constructs a system that satisfies it if one exists. Traditionally 
this only works for systems with Boolean input and output 
variables. However, real-world systems often use more so- 
phisticated data like integers, reals, or structured data. For 
finite domains, it is possible to use bit-blasting to obtain an 
equivalent Boolean specification. However, in general, bit- 
blasting techniques do not work for infinite domains, bit- 
blasted specifications are hard to read, and a large number 
of variables make the specifications very hard to solve. 

In recent years multiple theories have been proposed to 
perform reactive synthesis with non-Boolean inputs and out- 
puts. There have been decidability results for synthesis using 
register automata [2], [3], [4] and variable automata [5]. 

Our work builds on temporal stream logic (TSL). TSL, 
proposed by Finkbeiner et al. [6], uses a logic based on 
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Fig. 1. Synthesized system for the running example. 


linear temporal logic (LTL) with state variables, uninterpreted 
functions and predicates, and update expressions. TSL allows 
for an elegant and efficient synthesis method that separates 
control from data. However, the ability to specify how data 
is handled is limited because functions and predicates remain 
uninterpreted. Finkbeiner et al. [7] describe an extension to 
TSL modulo theories, but consider only satisfiability and not 
synthesis. 

In this paper, we propose a new synthesis algorithm for 
temporal stream logic modulo theories that can be applied to 
arbitrary decidable theories in which quantifier elimination is 
possible. Let us consider a concrete example using the theory 
of linear integer arithmetic (LIA). 


Example 1. We want to build a system with one integer 
state variable x and one integer input 7. The objective is to 
keep the value of the state variable between 0 and 100. At any 
time step the system can select one of two updates: increase 
or decrease x by i, where 7 is chosen by the environment in 
the interval 0 < i < 5. We assume that the initial state is 
any value inside the boundaries. These requirements can be 
written as the TSL formula 


@£(0<eAxr<100A0(0<idi<5))> 
(O0<aAau< 100A ([x 4} z -— i] V [x + x + i])), 


where the propositions [x < x — i] and [x + x + i] describe 
updates to x. Figure 1 shows a mealy machine that realizes 
the specification above. It is impossible to write a correct 
system using only the predicates from the specification: if the 
environment chooses initially x = 0 the system has to first 
perform addition, however, if the environment chooses 7 = 99 
the system has to perform subtraction. The predicates in the 
specification cannot distinguish these cases. 


Inspired by this example we want our synthesis algorithm 
to function with expressions from theories, as well as identify 
new predicates where necessary. Figure 2 shows an overview 
of our approach. We use a different refinement loop than the 


article is licensed under a Creative 


This 
Commons Attribution 4.0 International License 


extend 


prop. encoding 


= unrealizable 


realizable 


Counter Strategy 


concretize 


Boolean System Concrete System 


Fig. 2. Overview of the synthesis procedure. 


original TSL synthesis approach [6], our approach, which is 
depicted in Figure 2, relies on checking local properties and 
also is able to create new predicates. 

First, the TSL specification is encoded into an LTL formula 
that contains a Boolean variable for each theory predicate in 
the TSL formula. These variables are seen as inputs, which 
means that the environment determines their truth values. As a 
result, realizability of the LTL formula implies realizability of 
the TSL formula, but not vice versa, because the environment 
can choose values for the variables that are not consistent with 
the theory. The LTL formula is then given to a propositional 
LIL synthesis tool [8], [9]. If the Boolean synthesis is suc- 
cessful we obtain a Boolean system that can be concretized 
into a system that operates on the original value domain. 
If synthesis of the LTL formula is not successful, we get a 
Boolean counterstrategy that we analyze for inconsistencies 
with respect to the theory. 

The central part of our algorithm is the theory consistency 
analysis of the counterstrategy. In contrast to Finkbeiner et al. 
[6] we treat the counterstrategy as a Moore machine instead 
of a tree. The output of a state in a counterstrategy is a 
valuation of the predicates and the transitions perform updates 
on the register values. Whereas the original approach analyzes 
potentially long traces in the tree, we perform a local analysis 
of individual states and transitions. We use an SMT solver to 
check whether the predicate valuation in each state is consis- 
tent and whether the valuations in two consecutive states are 
consistent with the updates on the transition between them. We 
show that if all states and transitions are (locally) consistent, 
then the counterstrategy is globally theory-consistent as well 
and thus the TSL specification is unrealizable. 

If an inconsistency is found, the counterstrategy is spurious. 
We thus need to refine the LTL formula and start a new 
iteration. We refine the LTL specification by adding new 
assumptions and possibly new predicates. The assumptions 
refer to the relation between predicates (for inconsistent 
outputs of states) or between predicates and updates (for 
inconsistent transitions). If a transition is consistent for some 
values but inconsistent for others, we create a new predicate 
that distinguishes whether the update is valid or not. 

The procedure can be likened to a CEGAR loop [10] or 


to the DPLL(T) in which LTL synthesis plays the role of 
the propositional SAT solver and the consistency check is 
performed by the theory solver. The main difference is that 
in our case inconsistencies can span multiple time steps. 
Our approach has multiple advantages over the original TSL 
synthesis approach: it can easily be extended to new theories, it 
can find new predicate needed to realize certain specifications, 
and it can show unrealizability (without bound on the size of 
the considered systems). 
The main contributions of our paper are as follows: 


e A new synthesis procedure for TSL that works with 
theories and can generate additional predicates that are 
necessary to realize certain specifications. 

e Synthesis for TSL, in general, is known to be unde- 
cidable [6], we show that the problem is decidable for 
equality logic. 

e Our algorithm can prove that certain specifications are 
unrealizable. 


The remaining paper is structured as follows: Section II 
summarizes required definitions from TSL. Section III for- 
malizes the synthesis problem for TSL modulo theory. We 
describe the Boolean abstraction and the theory consistency 
analysis in detail in Section IV. The main synthesis procedure 
is described in Section V. An experimental evaluation was 
performed for multiple examples using the theories of linear 
integer arithmetic and linear real arithmetic (Section VI). We 
discuss related work in Section VII and conclude our work in 
Section VIII. 


II. PRELIMINARIES 


We use Temporal Stream Logic (TSL) [6], with the addition 
of decidable theories [7]. This section repeats the definitions 
from Finkbeiner et al. [6], [7] with some changes in notation 
and with a more general treatment of theories. 


A. Theories and Updates 


In contrast to [7], where axiomatic semantics of the used 
theories is used, we rely on the definitions built into an 
SMT solver. A theory 7 consists of a signature (symbols for 
constants, functions, and predicates) and semantics as defined 
by the SMT solver (which can be axiomatic). The domain 
is called T. In case of variables of different sorts (i.e. types) 
we use T for the union of all domains and assume that all 
variables take values from their domain. In the following, 
we will use E7(V) to denote the set of expressions in 7 
with the set of variables V denoting a superset of the free 
variables in the expression. The set F7(V) is partitioned 
into a set ELV) of terms (denoting values in T) and a set 
ER(V) of formulas (denoting truth values). We assume that 
the theories used have decidable procedures for satisfiability 
checking and quantifier elimination. We assume that we are 
given a procedure sat that returns true iff a formula ¢ is 
satisfiable and a function quantelim that takes a formula 
and returns a theory-equivalent formula that does not contain 
quantifiers. 
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B. Temporal Stream Logic Modulo Theories 


TSL(T) is based on linear temporal logic, but instead of 
Boolean variables it uses updates and Boolean theory expres- 
sions. Key concepts of TSL and TSL(T) are state variables 
R and input variables J which both hold values from the 
theory domain and updates that access state variables and 
input variables and write new values to the state variables. 
The grammar for TSL(T) formulas is 


(ap) = ER(RUD) 
(term) := E7(RU I) 
(bconst) := true | false 
(upd) := [(var) + (term)] 
(tsl) := (ap) | (upd) | (beonst) | —(tsl) | (tsl) A (tsl) 


(tsl) U (tsl) | Æ (tsl). 


To define the semantics of TSL(T) we need some additional 
notation that will be used throughout the paper. An update 
[r + e] assigns a state variable r an expression e € E7-(RUI). 
We use U for the finite set of all updates that occur in the 
formula under consideration as well as the updates that assign 
each state variable to itself. An update function u is a function 
that associates each state variable r € R with an expression 
e € E7(RUI). We refer to the set of update functions where 
all pairs (r,e) are updates in U as U. Equivalently, U can be 
seen as the set of all subsets of U that contain exactly one 
update for each state variable in R. 

We introduce the notations R, £ 2P>T and I £ 2/77 for 
the sets of valuations of variables. We write R/r (I/i) to 
denote the replacement of all variables in R (J, resp.) by their 
corresponding values in r € R (i € I, resp.). With slight abuse 
of notation, we identify e[R/r,I/i] with the corresponding 
value in the domain. To apply an update function u € U to 
valuations r € R and i € I we write u[r, i] which is defined 
as ufr, i](r) = u(r)[R/r, I/i] for each r. 

The semantics of TSL(T) is defined with respect to a trace 
p € (Ix R)“ of inputs and state variable valuations as follows. 
We assume that p = po, pi,... and that p; = (r;,i,;) and we 
define 


p F p iff po E p for p € (ap), 
p H [r + e] iff rı (r) = e[R/ro, I /iol, 


F true, 
K false, 
p E= iff pk 4, 


p 
p 


pH oA? iff pH ¢ and p E y, 
pP = o U y iff 3j-pj, Pitis- = w and 
Vi < J.Pi, Pi+1;,--- = @ 


p= X ¢ iff p1, pa,... 


= o. 


The unary temporal operators eventually () and globally (0O) 
can be added using their usual definitions: © y = true U ọ 
and oy = ~ ng. 


C. LTL Synthesis 


Our algorithm relies on existing solvers for the linear 
temporal logic (LTL) synthesis problem which we also refer 
to as propositional synthesis. For completeness, we provide 
a brief description of the problem. A formal treatment is 
available in [1]. 

Given an LTL formula ¢ containing Boolean variables X 
separated into the two disjoint sets X; and Xo. Consider a 
game between two players (environment and system) where at 
each point in time both players pick the values of their Boolean 
variables (first X; by the environment followed by Xo by the 
system). The game is won by the system if the resulting infinite 
trace satisfies ø and is won by the environment otherwise. 
The synthesis problem is: does there exist a Mealy machine 
strategy for the system that wins against every environment? If 
no such strategy exists there exists a Moore machine strategy 
for the environment that wins against every system. An LTL 
synthesis tool such as Strix [8] can determine who wins and 
construct a (Mealy or Moore) strategy for the winning player. 


II. SYNTHESIS PROBLEM FOR TSL(T) 


We want to synthesize systems from a TSL(T) specification, 
which we defined in the previous section. Before giving a 
formal definition of synthesis we need to define the systems 
we want to build. 

Our constructed systems differ from those considered by 
Finkbeiner et al.[6]. They synthesize control flow models, 
which consist of a circuit of logic gates and vertices of 
uninterpreted functions that determines the values of outputs 
and new cells based on inputs and old cells. We instead target 
an extension of Mealy machines. 


A. Theory Mealy and Moore Machines 


A system using state variables of an unbounded domain 
can be hard to represent finitely. To create actual programs 
our systems need to have a finite structure. This is achieved by 
restricting all operations on state and input variables to a finite 
set of symbolic operations. The values of state variables need 
to be determined by an update chosen from a finite set U. The 
set of all update functions using updates from U is denoted 
by U. To make decisions based on the values of variables we 
use a finite set of predicates P C EŻ (R U I). For a given 
valuation v = (r,i), let Py C P be the subset of predicates 
that is true in v: Py = {p € P | v |= p}. Using these we 
can define an extended trace as an infinite sequence pg = 
(ro, io, Uo, Py),... over (R x Ix U x 2”) such that for all j, 
rj41 = uj[rj,i;] and P; = Py, i;)- The corresponding theory 
trace is the trace (ro,i9),... over R x I. 

We introduce the new concept of Theory Mealy Machines 
that are state machines with inputs and state variables that 
range over the theory domain. A Theory Mealy Machine 
Mr = (Q,40,U, P, ro, ô, 4) consists of a finite set of states 
Q, an initial state go € Q, a finite set of updates U, a finite set 
of predicates P C EÈ(R U I), an initial valuation ro € R, a 
transition function ô € (Q x 2?) > Q and an update selection 
function pp € (Q x 2P?) + U. 
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A run o of a theory Mealy machine induced by a sequence 
of input valuations i = ig,i,,--- € I is an infinite sequence 
of states Q and valuations R (qo, ro), (q1, r1), .... Any two 
consecutive configurations (q;,r;) and (qi+1,ri+1) must be 
related by gi+1 = (qi, Per,,i,)) and riz1 = ufri, ii] where 
u; = L(G, Per, i,))- 

An extended trace is obtained from a run as the infinite 
sequence (ro, io, Uo, Pirg,ig));---- A Theory Mealy Machine 
My realizes a TSL(T) formula ¢ if for all inputs sequences 
i c I” the resulting theory trace p = (i, T) satisfies ¢. 

We also define Theory Moore machines, which read the 
updates produced by a Mealy machine and produce the inputs 
read by a Mealy machine. Intuitively, Mealy machines are used 
to show realizability of a TSL(T) specification, while Moore 
machines are used to show their unrealizability. A Theory 
Moore Machine My = (Q, qo, U, P, ro, ô, ) consists of a finite 
set of states Q, an initial state go € Q, a finite set of updates U, 
a finite set of predicates P C E?(RU T), an initial valuation 
ro € R, and a transition function ô € (Q x U) —> Q, and 
L: Q x R-— Lis the output function. 

A tun ø of a Theory Moore Machine induced by an 
infinite sequence of update functions u € U“ is a sequence 
of states and valuations (qo, ro), (q1, r1),.... Any two con- 
secutive entries (q;,r;) and (qi+1, r;+1) must be related by 
git = (qi, Us) and ri41 = ufri, (gi, ri)]. 

B. Problem Statement 


Given a TSL(T) formula ¢, the inputs J, the state variables 
R, and the updates U, the synthesis problem asks whether 
there exists a Theory Mealy machine My over R, I, and U 
such that for all input sequences i the trace generated by My 
satisfies @. Note that the created machine My must use the 
same variables J and R as well as the updates U as @, but it 
may use predicates P that are not present in @. 


IV. BOOLEAN ABSTRACTION 
A. Propositional Encoding of TSL(T) 


This subsection describes the propositional encoding of 
TSL(T) into LTL as proposed by Finkbeiner et al. [6]. The 
fact that the functions and predicates in our terms have 
an interpretation does not affect this translation and it is 
equivalent to the one for TSL. 

A TSL(T) formula ọ is encoded to an LTL formula 
og. Formula ¢g is obtained by replacing each update u 
in @ by a Boolean output variable p,, and each atomic 
proposition ap by a Boolean input variable Ppap. Addi- 
tionally, the formula ensures that for each variable ex- 
actly one update is active at any point in time. This 
results in: ġg = (NV: (te AN beg) ) ^ 
olap/Pap» -- -, U/Pu;s---]- 

Example 2. The TSL(T) formula from Example 1 is encoded 
as the LTL formula 


bp = 


(Pix xz—i] ^ Dire ati] V P[x+r+i] ^ “Pjr a—i])^ 
((po<a ^ Pe<i00 \ O(po<i ^ Pi<s)) > 
(Po<ex A Px<100 ^ (Pizzr—i] V Piz}z+i]))); 


where po<z, Px<100> Po<i, and pi<5 are input variables and 
Plaea—i] aNd Pirea+i] are Output variables. 


B. Boolean Mealy and Moore Machines 


Given a set of Boolean variables V = {p, | u E U}U {Pap | 
ap € P}, we say that a Boolean trace pg = vo, v1,-.. over 2V 
corresponds to an extended trace pp iff for all j, pap € pB(J) 
iff r; Ui; = ap and p, € pg(j) iff u € uj. Clearly, every 
extended trace corresponds to a Boolean trace, but the opposite 
is not true, for instance, because two predicates contradict each 
other, or because the updates and the predicates do not match. 

The LTL specifications obtained from the propositional 
encoding can be realized by standard Mealy machines or 
shown unrealizable by standard Moore machines. To make the 
meaning of the input and output variables clearer we will call 
them predicates P and updates U and refer to the machines as 
Boolean Mealy and Moore Machines. We use U and 2? and 
leave the translation into vectors of Boolean variables implicit. 

A Boolean Mealy machine is a tuple (Q, P, U, qo, ôg, LB); 
where Q is a set of states, U is a set of updates, P is a set of 
predicates, qo € Q is the initial state, ôg € Q x 2? > Q is 
the transition function, and ug € Q x 2? — U is the update 
selection function. 

A run og of a Boolean Mealy machine induced by 
a sequence of predicate sets P = P,P,,... is an 
infinite sequence of states, updates, and predicate sets 
(qo, Uo, Po), (q1, u1, Pi),--. where gi41 = dp(qi, Pi) and 
Ui+1 = uglqi, Pi). The corresponding Boolean trace pg is 
(uo, Po), (ui, P), EN 

A Boolean Mealy machine Mg is theory consistent with 
respect to theory 7 iff every trace pg induced by a consistent 
sequence P has a corresponding extended trace pp. 

A Boolean Moore machine is a tuple Mg = 
(Q, P,U,q,6,0) where Q is a set of states, P is a set 
of predicates, go € Q is the initial state, ô E€ Q x U > Q 
is the transition function, and o € Q — 2” is the output 
function. 

A run og of a Boolean Moore machine induced by a 
sequence UW = Ug, Uj,... is an infinite sequence of states and 
predicate sets (qo, Po), (q1, Pi),... where gi41 = 6(q, ui) 
and P; = o(q). We call a Boolean Moore machine theory 
consistent with respect to theory 7 iff every trace pg = 
Po, Pi,... can be extended to an extended trace pp. 


C. Theory Consistency Analysis 


We propose three criteria to locally analyze Boolean Moore 
machines Mg = (Q,P,U,q0,6,0) and sets of theory inputs 
variables I for theory consistency. 


e Every state must be inhabited by at least one concrete 
state i.e. sat(o(q)) for every q E€ Q. 

e Every transition must be valid for at least one pair of 
concrete pre and post states i.e. sat (o(q;) ^ u A 0(q;)’) 
for every (qi, u, qj) € ô. 

e Every transition must be valid for all concrete pre-states 
i.e. o(qi) > wp(u, di’. o(q;)) for every (qi, u, qj) € ô. 
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[e+ z+1] 


Fig. 3. Theory consistent machine that is not locally consistent. 


Lemma 1. Jf these local consistency criteria are satisfied for a 
Boolean Moore machine Meg, it is (globally) theory consistent 
and there exists a theory Moore machine M7 whose extended 
traces are consistent with the traces of Mp. 


Proof. Assuming the criteria are all satisfied. Every output 
function is satisfiable, this includes the initial state which 
contains an initial value rọ of a theory machine. For every 
transition (p;, u, pj) where there exists a model of y; all 
of the models map to a model of yj, by the transition 
property checked by the algorithm. Therefore by induction, 
all paths starting in the initial state and only using tran- 
sitions from 6 have a corresponding extended trace pp = 
(ro, io, uo, Po), (rı, iq, uj, P), ... Where ri+1 = u; (ro, io) 
and i; chosen such that r; Ui; = P;. A theory Moore machine 
can be obtained by providing a function o that chooses the 
values i; based on q; and r;. 


Example 3. Our approach checks consistency on a local 
level, the environment strategy can still be theory consistent 
if the third criterion is violated. The Boolean Moore machine 
in Figure 3 has two transitions (orange) that do not satisfy 
this criterion even though the machine is theory consistent. 
The blue annotations are not part of the machine but are used 
to argue its (global) consistency. The transition (q1, |x < 
x + 1],q2) would be invalid for x = 1 in ql, but for every 
execution x = 0 in ql and the problem does not appear. A 
similar situation occurs for the transition (q2, [x < x +1], q0), 
where x will always be 1 and the transition is only invalid for 
x < 1. The blue annotations show the possible values of x in 
every state, demonstrating that all transitions are consistent. 


We propose Algorithm 1 to locally analyze Boolean Moore 
machines for theory consistency, based on the criteria above. 
To be usable in our synthesis refinement loop the algorithm 
also creates additional assumptions and predicates that block 
inconsistent counter strategies. The counterstrategy analysis is 
performed in three stages. The first checks for consistency of 
outputs in a single state the second and third check consistency 
of transitions. The third check also creates new predicates. 
We will use some shorthand notation to define formulas: The 
output function o(q) will be used to refer to the expression 
consisting of the conjunction of the elements in P, negated 
if their corresponding Boolean variable is false. We use o(q)’ 
with the same meaning as o(q), except all free variables are 
renamed to their primed version. Similarly, u is to be read as 
the conjunction of r’ = e for each update in u. 

a) State Consistency: To check state consistency we look 
at every (reachable) state in the counterstrategy and use an 


def isconsistent(m): 
Data: Boolean Moore machine 
m = (Q, P, U, qo, ô, 0), set of variables I 
Result: T or (possibly) with additional 
assumptions and predicates. 


foreach q € reachable(Q) ; // Case 1 
if ssat(o(q)) then 
| yield L, O -o(q)); 
foreach (q;, u, qj) € reachable(ô) ; // Case 2 
if ssat(o(qi) ^u o(q;)’) then 
| yield L, O(o(q;) ^ u > ¥ —o(q;)); 
foreach (q;, u, qj) € reachable(ô) ; // Case 3 
wp := weakest precondition (u, 3i’. o(q;)’); 
wp := quantelim(wp); 
if sat(o(q;) ^~wp) then 
| yield L, O(~wp ^ u > ¥ -70(q;)); 
return T; 


Algorithm 1: Check Theory Consistency. 


SMT solver to check if the output assignment is consistent 
with the theory. For example, the two variables pr>5 and pz<o 
cannot be true in the same state. If such a problem is found 
we generate a new assumption that rules out this assignment 
in every state. In the previous example, this would generate 
the assumption O(=~Pr>5 V 7pz<0). 

b) Transition Consistency: Once all states produce con- 
sistent outputs and there still exists a counterstrategy, we turn 
towards transitions. As of now, there are no assumptions that 
link the state before an update was performed to the state 
afterward. This step checks if there are impossible transitions. 
We again use an SMT solver to perform this analysis. Let’s 
look at the transition {pz>5}[e + x + 1]{7pz>5}. To check 
it the following SMT problem is generated x > 5A a’ = 
x+1A-(a’ > 5), this is unsatisfiable and we can generate an 
assumption to eliminate it O(prz>5 A [£ < z +1] > ¥ pr>5). 

c) New Predicates: Another case is that a transition is 
possible for some, but not all of the values. For instance, the 
triple {p,<o}[v +} #+1]{pz>0} does not hold for all values of 
x. This shows that our current abstraction might not be precise 
enough to correctly describe this transition and we need an 
additional predicate. We calculate the weakest precondition of 
the post state given the updates of the transition. This gives 
us the predicate x > —1. If states in the pre-state are not 
included in the weakest precondition they can not take the 
transition. The weakest precondition can also be used as the 
new predicate to distinguish which concrete states may take 
a transition. In case there are input variables the future inputs 
are existentially quantified in the post-condition. This results 
in a natural extension of the weakest precondition, it contains 
all states that can reach one of the valid post-conditions. 


Lemma 2. If for a Boolean Moore machine Mg Algorithm 1 
returns inconsistent together with assumptions wW every Theory 
Moore machine Mr — where Mg and Mr share the inputs, 
state variables, and updates — satisfies w. 
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Proof. Algorithm 1 can produce three different types of as- 
sumptions WPs, Y+, Yp corresponding to the three cases of the 
algorithm. Let Myr be an arbitrary theory Moore machine. 
Let Y, be O ~—o(q) for an unsatisfiable o(q). Mr must define 
an output valuation for every state because o(q) is empty no 
state in any M7 can produce such an output. Therefore all 
Mr satisfy ws. 

Let y; be O(o(qi;) Au > ¥ 70(q;)) where ~sat(o(q) A 
u ^ o(q;)’) and sat(o(q;)). None of the values satisfying 
o(q;) have a successor in o(q;) after performing u. The added 
constraint is equivalent to =(0(q;) \uAo(q;)’) & 70(qi;)V7uVv 
-o(q;)') + (0(qi) Au) > 70(q;)!' + 0(q:) Au > X >olq;). 
All transitions in all My satisfy this property at all points in 
time. 

Let Yp be O(~p A u —> X -70(q;)) where p is the weakest 
precondition of o(q;) under u and sat (o0(q;) \uA-o0(q;)). By 
the definition of weakest precondition, no value in ~p leads 
to o(q;) when performing u. This also holds in the presence 
of inputs. The quantifier elimination procedure leads to the 
weakest precondition for unknown inputs at the next time step. 
All transitions in all My will lead from ~p to —0(q;) when 
performing u. 

All added constraints are satisfied by all states and tran- 
sitions in all My. The constraints only talk about individual 
states and transitions therefore also all traces in M7 satisfy 
these constraints and VM7.M7 — w. 


D. Generalizing Counterexamples 


The counterexamples generated by Algorithm 1 only block 
the exact state or transition present in the counterstrategy. To 
achieve faster and better convergence it is necessary to general- 
ize these counterexamples. Generalization of counterexamples 
is done using an algorithm to find an unsatisfiable core, i.e., 
a small (not necessarily minimal) subset of clauses such that 
their conjunction is unsatisfiable. 

This is done for each assumption returned from Algorithm 1 
as follows. For O ~o we compute Ouse = unsatcore(o) and 
produce the generalized assumption O —0,,,.. For O(olAu > 
X 702) we compute Olysc, Uusc, 02),,. = unsatcore(ol ^u ^ 
02’) by keeping track of where each conjunct originated. This 
is then turned back into the generalized assumption O(olusc ^ 
Uusce X 702usc)- 

Using unsat cores in this way allows us to find smaller 
counterexamples that do not depend on superficial information. 
Therefore, the counterexamples also block situations where 
unrelated predicates or updates are different. 


V. SYNTHESIS ALGORITHM 
A. Synthesis 


Our synthesis procedure is shown in Algorithm 2. The pro- 
cedure starts with a specification in TSL(T) that is translated 
to an LTL specification as described in Section IV-A. The 
LTL specification is given to a synthesis tool for propositional 
LTL. If the synthesis tool finds a realizing system, this system 
encodes a solution for the TSL(T) synthesis problem. If not, 
the LTL synthesizer gives us a counterstrategy, which we 


Data: TSL(T) specification: ¢ 
Result: Satisfying Mealy machine or unrealizable or 
non-termination 

while true do 

op ‘= prop_encode(¢); 

(rm) := synth(dp); 

if r is UNREALIZABLE then 
c, wy := isconsistent(m); 
if c is L then ọ¢ọ := 4% > ĝ; 
else return UNREALIZABLE; 

else 


return concretize(m); 
Algorithm 2: Synthesis using abstraction refinement. 
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Fig. 4. Synthesis steps for the running example. 


analyze to find any inconsistencies with the theory. If the 
counterstrategy is theory-consistent, it gives us a counterex- 
ample to the realizability of the TSL(T) formula. If the theory 
solver shows that the counterexample is theory-inconsistent, 
we refine the specification, strengthening it to exclude the 
inconsistency observed. We illustrate the approach using an 
example and then prove its correctness. 


Example 4. We apply Algorithm 2 to synthesize a system 
from the specification ¢ given in Example 1. To better illustrate 
the algorithm we will only introduce one new assumption per 
iteration. The machines created during execution (5 counter 
strategies in the form of Moore machines and one strategy in 
the form of a Mealy machine) are depicted in Figure 4. We 
refer to the specification in step n as n = N= Pk —> ọ 
where wv, are the assumptions added in step k. 

The LTL encoding of ¢; = @ is ¢g. (See Example 2) 
Propositional synthesis results in the counterstrategy shown 
in Figure 4.1. Algorithm 1 reveals that the second state of 
this counterexample (shown in red) is inconsistent, because 
«(0 < x) A7(a < 100) is unsatisfiable. We obtain the new 
assumption pı  O(0 < z V æ < 100). Note that this is a 
more general assumption than just the negated state formula. 

Attempting to synthesize a system for ¢; (we will omit the 
LTL encoding step from now on) results in the counterstrat- 
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egy shown in Figure 4.2. This time, all outputs are consistent, 
but the transition [a < x — 1] is inconsistent: If £ < 100 and 
0 > i and we apply the update [a + «x — i], it cannot be 
that x £ 100 in the following state. We obtain the assumption 
p2  O(x < 100 A0 < iA [z + z — i] — ¥ x < 100). 

Boolean synthesis for @2 results in a similar counterstrategy 
(Figure 4.3), this time the transition [x < x+ i] is inconsistent 
and ys =O(0< 2A0<iAl[rezt+i] > X0< 2). 

Synthesis for #3 leads to the counterstrategy shown in 
Figure 4.4. The first two consistency checks pass, but case 3 
reports that the transition [x + æ -— i] is only valid for some of 
the possible values in the first state. We learn the new predicate 
0 < x—iand the assumption Y4 + O(0 < 2—-iA[x + z—i] > 
X0< 2)A0D7A(0 <iA0< «%-iA0 £ x). This includes 
the assumption obtained from the transition as well as one to 
prevent state inconsistencies with the new predicate!. 

Running the Boolean synthesis algorithm again results in the 
counterstrategy in 4.5. The transition [x + x+i] is inconsistent 
and we add 5 £ O(0 £2-iAi<5Alee arti orXa< 
100). 

Boolean synthesis is executed for the last time on ¢5. This 
time, the synthesis tool produces a Boolean system satisfying 
the specification this corresponds to the system shown in 
Figure 1. 


Theorem 1. Any Boolean system Mg returned by Algorithm 2 
can be converted to a theory system M7 that satisfies ©. 


Proof. To obtain My an initial valuation ro that satisfies @ at 
point 0 has to be chosen. All other components are the same 
as in Mg. Let œ! = Yy — ¢ be the last specification used 
in the algorithm and w all the added assumptions. We know 
Mg }| @, and that Mg and My share the same extended 
traces. Therefore, My satisfies ¢’. From Lemma 2 follows 
that Mr H Y. Thus My satisfies ¢. 


Even though our algorithm is not guaranteed to terminate 
it can prove unrealizability in certain cases. For x = 0 > 
(lc +} £z+1]^ zx < 3) we can perform two refinement steps 
and learn the new predicates x > 2 and x > 1. Using these the 
propositional synthesis tool can build a consistent environment 
strategy. There are no conflicts that could be used to further 
refine the specification. This shows that the specification is 
unrealizable. 


Theorem 2. If Algorithm 2 returns unrealizable there is no 
Mr &— ¢ and ¢ is unrealizable by machines using the updates 
U. 


Proof. If there exists a machine M4- |= ~ọ there is no 
machine My |= ¢. The propositional synthesis tool provides 
us with a machine Mg = ~=’ where # = Yy — 6 for 
assumptions 7. The consistency check results in consistent, 
so by Lemma 1 there exists a M4 |} -¢'. According to 
Lemma 2 4 is satisfied by M+. Therefore, M7 = —¢ and no 


M7 E ¢. 


'We include this here instead of in its own step to simplify the example. 


B. Limitations 


Our algorithm is not guaranteed to terminate. In Section VI 
we discuss multiple specifications that can be successfully 
synthesized or where unrealizability can be shown. In this 
section, we show two exemplar cases for which our algorithm 
will not terminate. 

Our algorithm cannot handle reachability properties where 
the number of required steps depends on the concrete value of 
a state variable and is unbounded. The specification 0 < x + 
(O(a < 0) ^A O(fx — z + 1] V [x + x —1))) with the state 
variable x is an example of this. The specification is obviously 
realized by a system always using the update [xz + x — 1]. 
However, we would add the new predicates x > 1, x > 2,... 
without terminating. 

A similar problem can occur for unrealizability. For x = 
1 > (Q(x = 0) A Ofx + x + 1]) we learn the predicates x = 
—1,x = —2,... without terminating. However, the predicate 
x > 1 would allow us to prove unrealizability. 


C. Decidable fragment 


Theorem 3. The TSL synthesis problem for the theory of 
equality is decidable. 


Algorithm 2 will always terminate if the set of predicates 
that Algorithm 1 can generate is finite. For a finite set of 
predicates the assumptions that can be added by cases, one 
and two are also finite. Since the assumptions block the 
counterstrategy from reappearing this means there can not be 
infinitely many counter strategies and the synthesis algorithm 
will terminate with the correct answer. The theory of equality 
only allows updates to move values. Iterating the weakest 
precondition can only create finitely many predicates, for the 
equality theory also the quantifier elimination cannot introduce 
new constants. Thus the number of possible predicates is finite 
and the problem is decidable. 


VI. EXPERIMENTAL EVALUATION 


We implemented our algorithm in our tool Raboniel?. Our 
implementation relies on several external tools: tsltools [6] 
is used for parsing TSL and to perform the propositional 
encoding, strix [8] is used for LTL synthesis, and Z3 [11] 
is used as the SMT solver. When performing counterexample 
analysis using Algorithm 1 we add all assumptions from the 
same case before we start the next iteration. The obtained 
theory Mealy machine can be compiled into a Python program. 


A. Extended running example 


The first experiment is an extension of Example 4. We 
change two parameters in the specification. The system is no 
longer allowed to change between the two updates at every 
step. Instead after changing the update, it has to use the new 
update for the next c steps. This shows how our algorithm 
deals with more complex temporal properties. We also varied 
the size of the intervals for x and i demonstrating that our 
algorithm is independent of the size of the concrete state 


*https://doi.org/10.528 1/zenodo.5647461 
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TABLE I 
RESULTS FOR THE EXTENDED RUNNING EXAMPLE. 
č mar imaz # refine. | # states | # new pred. time [s] 
1 100 5 4 1 2 1.0 
2 100 5 5 2 2 1.3 
2 | 100 000 50 5 2 2 1.3 
3 100 5 9 2 4 2.9 
3 | 100 000 50 10 2 4 3.1 
1 100 110 6 unreal. 3 1.0 
2 100 60 4 unreal. 2 0.8 


space. The results are listed in table I including the used 
parameters (C, maz, tmax), the number of refinements, the 
number of states in the minimized system, the number of 
learned predicates during the whole execution and the total 
run time in seconds. The table includes realizable as well as 
unrealizable configurations. The different ranges for x and i 
show how our approach can handle state spaces symbolically, 
it behaves the same whether there are 100 or 100000 concrete 
states, these number of concrete states is also far above what 
can be solved with explicit states in LTL. The larger values 
of c require the system to plan further ahead by limiting how 
often it can switch its output, this requires a second state and 
a few additional predicates. 


B. Elevator 


A classic example for reactive synthesis is a controller 
for an elevator. The single state variable floor represents 
the current position of the elevator. It can start anywhere 
between the first floor and the maximum floor and is not 
allowed to leave this interval. The controller has three options: 
move the elevator up or down or stay in the same position. 
Every floor has to be visited infinitely often. The results are 
shown in Table II as type simple. We varied the number of 
floors of the building to show how our algorithm scales with 
more complex specifications. No new predicates are learned 
as a sufficient number of predicates is already included in 
the specification (equality tests for every floor are part of 
the liveness properties). The required time seems to grow 
exponentially with the growing number of floors. This leads 
to growing propositional synthesis problems (which is worst 
case double-exponential). The number of states stays constant 
because most of the complexity is part of the predicates e.g. 
the position of the elevator. The overall time is still reasonable 
even for a large number of floors. 

A different version of this specification is shown in Table II 
as type signal. In this version, the environment controls a 
variable signal to select the floor the elevator has to reach 
which is stored in the state variable target. This results in 
a more complex specification with worse run time, which is 
dominated by propositional synthesis. 


C. Cyber-Physical Systems 


The previous examples all used linear integer arithmetic. 
We can also use other SMT theories like linear real arithmetic 
(LRA). Using reals allows us to model linear cyber-physical 
systems. This example is inspired by Belta et al. [12] chapter 


TABLE II 
RESULTS FOR THE ELEVATOR. 


type # floors | # refinements | # states | time [s] 
simple 3 13 2 3.1 
simple 4 11 3 3.7 
simple 5 15 + 8.2 
simple 8 21 4 45 
simple 10 24 4 185 
signal 3 5 1 35 
signal 4 5 1 217 
signal 5 6 1 1424 


9, a system of two coupled water tanks with linear dynamics; 
one water tank drains (x2) and the other one (x1) is refilled 
by the controller. An illustration is depicted in Figure 5. 
We discretize the inputs (refill 

tank x1) with two values (0 fu: 
and 0.0003), represented as dif- 
ferent updates. We created two 
variants of the system. The first 
one is a safety specification 
where the water level of both 
tanks has to be kept between 
0.1 and 0.7. 

Synthesis of this system takes 31 seconds and 4 refinements. 
The resulting system only has a single state, but 13 new 
predicates where required. 

The second version consists of only one water tank, but 
requires the liveness property: whenever the water level falls 
below 0.1 it has to eventually exceed 0.4. A system realizing 
that specification can be synthesized using 18 refinements in 
95 seconds (9 new predicates), it consists of 2 states. These 
examples demonstrate that our tool can handle updates with 
more complex operations. This leads to a large number of 
new predicates, but can still be synthesized in less than two 
minutes. 


x 
is) 


c 
| 

x 

% 


TTT TTT 


Fig. 5. Water tanks system 


D. Comparison with Related Work 


The TSL paper by Finkbeiner et al. [6] contains various 
examples of TSL specifications. However, most of them do 
not require any refinement and the first LTL approximation 
is already realizable. For these examples, our tool would 
perform the same, because no theory refinement is used. A 
small number of examples required refinement. We converted 
two of them by replacing the uninterpreted functions for 
increment and decrement with native integer operations. The 
implementation of their refinement approach is not publicly 
available, we thus compare our results to the numbers reported 
in their paper. The experiment “TwoCountersInRange” took 
our tool 8 refinements and 1677s compared to their 173s. 
The experiment “OneCounterGUI’ took our tool only 4 refine- 
ments and 17.2s, this is a factor 100 faster than their 1767s. 
These results suggest that both tools can outperform the other 
by an order of magnitude, depending on the example. 

We also compared our tool using a benchmark set of safety 
games on infinite grid worlds that was introduced by Neider 
and Topcu [13]. We compare with the following tools: the 
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TABLE IN 
INFINITE GRID WORLD BENCHMARKS FROM [13]. ALL TIMES IN SECONDS 
— DENOTES A TIME-OUT AFTER 900s. TOOL ABBREVIATIONS: C 
(CONSYNTH), J (JSYN-VG), D (DT-SYNTH), S (SAT-SYNTH), R 
(RPNI-SYNTH), G (GENSYS) 


Benchmark C J D S R G Raboniel 
Box 3.7 | 0.6 | 0.3 0.3 0.1 | 0.3 1.9 
Box Limited | 0.4 | 1.7 | 0.1 0.4 0.5 | 0.2 0.6 


Diagonal 1.9 | 40 |24| 13 |05 | 0.2 10.9 
Evasion 1.5 | 0.5 | 0.2 81 0.1 | 0.7 5.4 
Follow — | 12 | 0.3 | 88.9 | — | 0.7 — 

Solitary Box | 0.4 | 0.9 | 0.1 | 0.3 | 01 | 0.3 0.5 
Square 5x5 — | 65 | 2.5 0.6 | 0.2 | 0.3 75.1 


logic-based synthesis tools ConSynth [14], JSyn-VG [15], and 
GenSys [16]; the automata-learning-based tools SAT-Synth 
and RPNI-Snyth [13]; as well as the decision-tree-learning 
tool DT-Synth [17]. These experiments are shown in Table II 
our tool is listed as Raboniel, the results for the other tools are 
reproduced from [16]. Our tool is able to solve 6 out of the 7 
benchmarks within 15 minutes. The execution time of our tool 
is on the lower end of the spectrum. However, TSL(T) allows 
us to express and handle more sophisticated specifications. 
Most other tools (except ConSynth and JSyn-VG) only support 
safety properties and would not be able to handle the other 
examples shown in this paper. 


VII. RELATED WORK 


The first paper on TSL [6] introduces this logic as a way to 
do synthesis while separating control flow and data processing. 
Reactive synthesis is used to build a control flow model which 
describes how the uninterpreted functions are combined and 
which of them is used when based on a logic circuit. This 
model can then be instantiated and translated to a functional 
reactive program (FRP)[18]. Our approach has less separation 
of data and control, by supporting theories we can reason about 
a lot of operations and construct systems that would not be 
possible using uninterpreted functions. We also directly create 
executable code without the intermediary FRP. Another major 
difference is the analysis of counter strategies. Finkbeiner et 
al. use an algorithm specific to uninterpreted functions that 
checks all possible traces up to a certain length. We have 
shown that consistency checking can also be done by local 
checks in a theory-independent way. That way we can also 
learn new predicates which allow for the synthesis of otherwise 
impossible specifications and prove unrealizability. 

The extension to TSL(T) was first done by Finkbeiner 
et al. [7] they study uninterpreted functions and Presburger 
arithmetic and provide a search-based algorithm to check 
satisfiability. However, they did not look into synthesis. 

Another recent extension of TSL is by Choi et al. [19]. 
They describe a different approach to adding arithmetic to TSL 
using syntax-guided synthesis (SyGuS) [20]. The TSL formula 
is translated into sequential SyGuS problems and the solutions 
are used to create assumptions. This technique cannot create 
new predicates and thus will not be able to solve problems 


such as our running example. Their solution was developed 
independently and in parallel to our approach. 

Other techniques for reactive synthesis beyond Booleans 
are: Reactive synthesis from register automata specifications 
has been studied[2], [4], [3], [21]. These models allow compar- 
ison (equality/inequalities) of data values, but no operations. 
Multiple decidable fragments have been identified. Another 
approach uses variable automata [5], [22] specifications these 
can perform arithmetic the authors also identified a decidable 
fragment. While for both register and variable automata strong 
theoretical results have been achieved we are not aware of 
any empirical evaluations. There are also synthesis tools that 
specifically target cyber-physical systems [23], [24] these 
often rely on finite or receding horizons instead of infinite 
traces. counterexample guided methods have also been used 
for program synthesis [25] and model synthesis [26]. 


VIII. CONCLUSION AND FUTURE WORK 


The algorithm presented in this paper performs specification 
refinement in a pure lazy way. That is new assumptions are 
only added when they are encountered in a counterexample. 
Performing some analysis upfront and after learning new 
predicates has the potential to significantly improve the run 
time. Testing for incompatibilities between predicates would 
be an obvious target for this. Another extension would be 
new strategies for learning predicates and heuristics to prevent 
learning unnecessary predicates (slowing down propositional 
synthesis). 

We presented a synthesis procedure for temporal stream 
logic modulo theories. Our algorithm is based on a CEGAR 
[10] loop and translation to propositional LTL synthesis. The 
synthesis problem for TSL modulo theories, in general, is 
undecidable. However, we can synthesize systems or prove 
unrealizability in many cases. Huge state spaces can be 
handled by using a symbolic representation during synthesis. 
Some specifications require new predicates, in many cases, we 
are able to automatically find these. 
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Abstract—The identification of a deterministic finite automaton 
(DFA) from labeled examples is a well-studied problem in the 
literature; however, prior work focuses on the identification of 
monolithic DFAs. Although monolithic DFAs provide accurate 
descriptions of systems’ behavior, they lack simplicity and inter- 
pretability; moreover, they fail to capture sub-tasks realized by 
the system and introduce inductive biases away from the inherent 
decomposition of the overall task. In this paper, we present 
an algorithm for learning conjunctions of DFAs from labeled 
examples. Our approach extends an existing SAT-based method to 
systematically enumerate Pareto-optimal candidate solutions. We 
highlight the utility of our approach by integrating it with a state- 
of-the-art algorithm for learning DFAs from demonstrations. Our 
experiments show that the algorithm learns sub-tasks realized by 
the labeled examples, and it is scalable in the domains of interest. 


I. INTRODUCTION 


Grammatical inference is a mature and well-studied field 
with many application domains ranging from machine learning 
to computational biology [1]. The identification of a mini- 
mum size deterministic finite automaton (DFA) from labeled 
examples is one of the most well-investigated problems in this 
field. Furthermore, with the increase in computational power in 
recent years, the problem can be solved efficiently by various 
tools available in the literature (e.g., [2], [3]). 

Existing work on DFA identification primarily focuses on 
the monolithic case, i.e., learning a single DFA from examples. 
Although such DFAs capture a language consistent with the 
examples, they may lack simplicity and interpretability. Fur- 
thermore, complex tasks often decompose into independent 
sub-tasks. However, monolithic DFA identification fails to 
capture the natural decomposition of the system behavior, 
introducing an inductive bias away from the inherent de- 
composition of the overall task. In this paper, we present an 
algorithm for learning DFA decompositions from examples by 
reducing the problem to graph coloring in SAT and a Pareto- 
optimal solution search over candidate solutions. A DFA 
decomposition is a set of DFAs such that the intersection of 
their languages is the language of the system, which implicitly 
defines a conjunction of simpler specifications realized by the 
overall system.'!We present an application of our algorithm to 
a state-of-the-art method for learning task specifications from 
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unlabeled demonstrations [4] to showcase a domain of interest 
for DFA decompositions. 

Related Work. Existing work considers the problem of 
minimal DFA identification from labeled examples [1]. It is 
shown that the DFA identification problem with a given upper 
bound on the number of states is an NP-complete problem [5]. 
Another work shows that this problem cannot be efficiently 
approximated [6]. Fortunately, practical methods exist in the 
literature. A common approach is to apply the evidence 
driven state-merging algorithm [7], [8], [9], which is a greedy 
algorithm that aims to find a good local optimum. Other works 
for learning DFAs use evolutionary computation [10], [11], 
later improved by multi-start random hill climbing [12]. 

A different approach to the monolithic DFA identification is 
to leverage highly-optimized modern SAT solvers by encoding 
the problem in SAT [13]. In follow up works, several symme- 
try breaking predicates are proposed for the SAT encoding to 
reduce the search space [3], [14], [15], [16]. However, to the 
best of our knowledge, no work considers directly learning 
DFA decompositions from examples and demonstrations. 

This work also relates to the problem of decomposing a 
known automaton. Ashar et al. [17] explore computing cas- 
cade and general decomposition of finite state machines. The 
Krohn—Rhodes theorem [18] reduces a finite automaton into a 
cascade of irreducible automata. Kupferman & Mosheiff [19] 
present various complexity results for DFA decomposability. 

Finally, the problem of learning objectives from demonstra- 
tions of an expert dates back to the problem of Inverse Optimal 
Control [20] and, more recently in the artificial intelligence 
community, the problem of Inverse Reinforcement Learning 
(IRL) [21]. The goal in IRL is to recover the unknown reward 
function that an expert agent is trying to maximize based 
on observations of that expert. Recently, several works have 
considered a version of the IRL problem in which the expert 
agent is trying to maximize the satisfaction of a Boolean task 
specification [22], [23], [4]. However, no work considers learn- 
ing decompositions of specifications from demonstrations. 


II. PROBLEM FORMULATION 


Let D denote the set of DFAs over some fixed alphabet 
E. An (mj,,...,Mn)-DFA decomposition is a tuple of n 
DFAs (Aj,...,An) E€ D” where A; has m; states and 


‘Our algorithm and SAT encoding can easily be generalized to unions or 
even arbitrary Boolean combinations of DFAs. 
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my < M2 < +++ < My. We associate a partial order < on 
DFA decompositions using the standard product order on the 
number of states. That is, (A1, ..., Ah) < (A1,...,An), if 
m; < m; for all i € [n] and m} < m; for some j € [n]. In 
this case, we say (A1, ..., Ah) dominates (A1, ..., An). A 
DFA decomposition (A1, ..., An) accepts a string w iff all 
A; accept w. A string that is not accepted is rejected. The 
language of a decomposition, L(A1,..., An), is the set of 
accepting strings, i.e., the intersection of all DFA languages. 

In order to bias towards “simpler” solutions, we further 
extend the partial order < over equally sized (i.e., if m; = mi 
for all i € [n]) decompositions by letting (A{,...,A/,) < 
(Ai,...,An) if (A1, -.., Ah) has fewer total non-stuttering 
edges than (Aj,...,An). 

We study the problem of finding a DFA decomposition from 
a set of positive and negative labeled examples such that the 
decomposition accepts the positive examples and rejects the 
negative examples. We start by formally defining the DFA 
decomposition identification problem (DFA-DIP), and then 
presenting an overview of the proposed approach. 


The Deterministic Finite Automaton Decomposition Iden- 
tification Problem (DFA-DIP). Given positive examples, D+ 
and negative examples, D_, and a natural number n € N, find 
a (mı, .. ., Mn )-DFA decomposition (Ai,...,An) satisfying 
the following conditions. 

(C1) The decomposition is consistent with (D+, D-): 


D4 = L(A, A2,. 28 » An), 
D- E TULA Aa Ae] 


(C2) There does not exist a DFA decomposition that dominates 
(A1,..., An) and satisfies (C1). 


We refer to the set of DFA decompositions that solve an in- 
stance of DFA—DTP as the Pareto-optimal frontier of solutions. 
Note that for n = 1, DFA-DIP reduces to monolithic DFA 
identification. We propose finding the set of DFA decomposi- 
tions that solve DFA-DIP by reduction to graph coloring in 
SAT and a breadth first search in solution space. Specifically, 
we extend the existing work on SAT-based monolithic DFA 
identification [13], [15] to finding n DFAs with m,...,mn 
states and q non-stuttering edges such that the intersection of 
their languages is consistent with the given examples. On top 
of this SAT-based approach, we develop a search strategy over 
the numbers of states and edges passed to the SAT solver as 
these values are not known a priori. 


III. LEARNING DFAS FROM EXAMPLES? 


In this section, we present the proposed approach. We start 
with the SAT encoding of the DFA decomposition problem and 
continue with the Pareto frontier search in the solution space. 
We then showcase an example of learning conjunctions of 
DFAs from labeled examples. Finally, we present experimental 
results and evaluate the scalability of our method. 


2Our MIT licensed code is freely available at [24]. 


A. Encoding DFA-DIP in SAT 


We extend the SAT encoding for monolithic DFA identifi- 
cation presented in [13], [15], which solves a graph coloring 
problem, to finding n DFAs with m1, M2, ..., Mn states. The 
extension relies on the observation that for conjunctions of 
DFAs, we need to enforce that a positive example must be 
accepted by all DFAs, and a negative example must be rejected 
by at least one of the DFAs. Due to space limitations, we only 
present the modified clauses of the encoding, and invite reader 
to Appendix A of the extended version of the paper [25] for 
further details. 

The encoding works on an augmented prefix tree acceptor 
(APTA), a tree-shaped automaton with nodes corresponding 
to prefixes and edges to appending letters, constructed from 
given examples, which has paths for each example leading 
to accepting or rejecting states based on the example’s label; 
therefore, an APTA defines D, and D_ which then constrains 
the accepting states, rejecting states, and the transition function 
of the unknown DFAs. For each DFA, A;, the encoding will 
associate the APTA states with one of the m; colors for DFA 
Ai, subject to the constraints imposed by D+ and D_. APTA 
states with the same (DFA-indexed) color will be the same 
state in the corresponding DFA. We refer to states of an APTA 
as V, its accepting states as V}, and its rejecting states as V_. 
Given n for the number of DFAs, m4, ..., Mn for the number 
of states of DFAs, and q for the number of non-stuttering 
edges, the SAT encoding uses three types of variables: 


V; i € [mp)]) iff 


1) color variables xf; = 1 (k € [n]; v 
APTA state v has color 7 in DFA k, 
2) parent relation variables Yii = 1 (k € [n]; 1 € £, where 
X is the alphabet; i, j € [m,]) iff DFA k transitions with 
symbol | from state ¿ to state j, and 
3) accepting color variables z¥ = 1 (k € [n]; i € [m]) iff 
state 1 of DFA k is an accepting state. 
The encoding for the monolithic DFA identification also uses 
the same variable types; however, in our encoding, we also 
index variables over n DFAs instead of a single DFA. With this 
extension, one can trivially instantiate the encoding presented 
in [13], [15]. Below, we list the new rules we define for our 
problem. For the complete list of rules, see Appendix A of 
the extended version of the paper [25]. 


(R1) A negative example must be rejected by at least one 
DFA: 
ANV fie ae 
veEV_ ke€[n] icim] 


(R2) Accepting and rejecting states of APTA cannot be 
merged: 


\ VAN \ VAN (z_i Azi) = ee 
v—EV_ vp EVy kE[n] i€[me] 


(R3) Upperbound on the number of non-stuttering edges: 


2 Ds 


ke[n] led i,je [my] iZi 


k 
Wij SA 
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In the encoding of [13], [15], we replace the rule stating 
that the resulting DFA must reject all negative examples with 
(R1), and (R2) is used instead of the original rule stating that 
accepting and rejecting states of APTA cannot be merged. 
Notice that since a rejecting state of APTA is not necessarily a 
rejecting state of a DFA k, we need to use the new rule (R2). 
Finally, (R3) enables controlling the maximum number of non- 
stuttering transitions. As we shall see, this will enable us to 
satisfy (C2). 


Theorem 1. Given labeled examples, n for the number of 
DFAs, Mm1,...,Mn for the number of states of DFAs, and q 
for the number non-stuttering edges, a solution to the above 
SAT encoding satisfies (C1) of DFA-DIP. 


Proof: We assume that the SAT-based reduction to graph 
coloring for monolithic DFA identification given in [13] is 
correct. Next, observe that (R3) can only remove solutions and 
thus does not effect (C1). Constraint (R1) and (R2) replace 
similar constraint in the monolithic encoding given in [13]: 
(R1’) a negative example must be rejected by the DFA: 


\ VAN Tvi = i, and 


vEV_ iE€[mx] 


(R2’) accepting and rejecting states of the APTA cannot be 


merged: 
VAN \ \ Ly_ i = TTv, ie 


v- EV- vp EVy iE[mk] 


In the monolithic DFA case, there is only a single DFA so 
for ease of notation, we drop the index k. First notice that 
constraints (R1) and (R2’) have no bearing on whether the 
DFA accepts each positive example. Therefore, our encoding 
automatically requires that each DFA in the DFA decomposi- 
tion accepts all of the positive examples and is not constrained 
to unecessarily accept any unspecified examples. 

Constraint (R1’) ensures that the resulting monolithic DFA 
rejects every negative example by making the color of the node 
in the APTA associated with the negative example rejecting. 
Constraint (R1) replaces this and ensures that at least one of 
the DFAs in the DFA decomposition rejects a negative example 
by making the color of the node in the APTA associated with 
the negative example rejecting in at least one of the n DFAs 
in the decomposition. Thus, the language intersection of the 
resulting decomposition correctly rejects negative examples. 

Constraint (R2’) ensures that all pairs of rejecting and 
accepting nodes of the APTA cannot be assigned the same 
color (i.e., merged) in the resulting DFAs. Constraint (R2), 
which replaces (R2’), ensures that for each DFA in the de- 
composition, the pair er at a) of accepting and rejecting 
nodes of the APTA cannot be assigned the same color only 
if DFA & is rejecting the negative example associated with 
ee (which is handled by constraint (R1)). This allows all 
but one DFA in the DFA decomposition to accept negative 
examples. Therefore, the language of the decomposition is not 
constrained to reject any unspecified examples. 

a 


Algorithm 1 Pareto frontier enumeration algorithm. 


Require: Positive D, and negative D_ labeled examples and 
positive integer n. 


1: (P*, Q) {= {(1, dott 1)} > Initial Pareto front and queue. 
2: while Q 4 0 do 

3 m + Q.dequeue() 

4 if Arn € P* s.t. ñm < m then 

5: SAT, A + SOLVE(n,m, D4, D—) D> Omits (R3). 
6 if SAT then 

7 P= PUA > Add to the Pareto frontier. 
8 else 

9 for k = 1,...,n do 
10: (m,m) 4 (m,m, + 1) 
11: if ordered(m’) then Q.enqueue(m’) 


12: return minimize_stutter(P x) > Binary search using (R3). 


B. Pareto Frontier Search 


The SAT encoding detailed in section I-A produces a DFA 
decomposition that satisfies (C1), but not necessarily (C2). 
In this section, we provide the details of the Pareto frontier 
enumeration algorithm that uses the SAT encoding as an inner 
loop to find a DFA decomposition that solves DFA-DIP. 

Our proposed Pareto frontier enumeration algorithm is a 
breadth first search (BFS) over DFA decomposition size tuples 
that skips tuples that are dominated by an existing solution. 
This BFS is over a directed acyclic graph G = (V, Æ) formed 
in the following way. There is a vertex in the graph for 
every ordered tuple of states sizes. There is an edge from 
(M1, M2,..-, Mn) to (m4, M53,- ., m4) if there exists some 
j € [n] such that: 


j Pi 
mi = 
Mi 


A size tuple (M1,..., Mn) is a sink, i.e., the search does 
not continue past this vertex, if there exists a (m1, ... , Mn )- 
decomposition that solves DFA-DIP or the size tuple is 
dominated by a previously traversed solution. In the prior 
case, the associated DFA decomposition is also returned as 
a solution on the Pareto-optimal frontier. The BFS starts from 
mı Mə Mn 1, and performs the search 
as explained. Algorithm 1 presents the details of the BFS 
performed in the solution space for finding the Pareto frontier. 
After finding a minimal number of states M1, M2, .. ., Mn 
that solve the problem, there still might exist multiple DFA 
decompositions of that size that solve (C1). These ties are 
broken in favor of DFA decompositions that have the fewest 
total non-stuttering edges, q. For each minimal dfa this is done 
by a binary search over q and denoted: minimize_stutter(e). 


ifi=J; 
otherwise. 


Theorem 2. Algorithm 1 is sound and complete; it outputs the 
full Pareto-optimal frontier of solutions without returning any 
dominated solutions, therefore satisfying (C2) of DFA-DIP. 


Proof: See the extended version of the paper [25]. a 
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(a) Experiment results answering (Q1), where we vary number of DFAs. 
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(b) Experiment results answering (Q2), where we vary number of examples. 


Fig. 1. Experiment results evaluating the scalability of our algorithm w.r.t. (a) number of DFAs implied by the examples and (b) number of labeled examples. 


C. Example: Learning Partially-Ordered Tasks 


We continue with a toy example showcasing the capabilities 
of the proposed approach. Later, we use the same class of 
decompositions to evaluate the scalability of our algorithm. 


Inspired from the multi-task 
-0+0 


reinforcement learning lit- 
erature [26], our example 
(a) Learned DFA recognizing the order- 
ing between |“! and M. 


focuses on partially-ordered 
-050 


temporal tasks executed in 
parallel. Specifically, con- 

(b) Learned DFA recognizing the order- 

ing between B& and L. 


sider a case where an agent 
is performing two ordering 
tasks in parallel: (i) observe 
[4] before Æ, and (ii) ob- 
serve M before M, A posi- 
tive example of such behav- 
ior is simply any sequence 
of observations ensuring both of the given orderings, e.g. 
eme, and a negative example is any sequence that fails 
to satisfy both orderings, e.g. FEI., We generate such 
positive and negative examples and feed them to our algorithm. 
Figure 2 presents the learned DFAs recognizing ordering sub- 
tasks of the example. The intersection of their languages is 
consistent with the given observations, and their conjunction 
is the overall task realized by the system generating the traces. 
The monolithic DFA recognizing the same language has nine 
states, and is more complicated (see Figure 4 in Appendix C 
of the extended version of the paper [25]). 


Fig. 2. Learned DFA decomposition. 


D. Experimental Evaluation 


We evaluate the scalability of our algorithm through ex- 
periments with changing sizes of partially-ordered tasks in- 
troduced in Section IJI-C. In our evaluation, we aim to 
answer two questions: (Q1) “How does solving time scale 
with the number of ordering tasks?”, and (Q2) “How does 
solving time scale with the number of labeled examples?”. 
We implement our algorithm in Python with PySAT [27], and 
we use Glucose4 [28] as the SAT solver. Our baseline is an 
implementation of the monolithic DFA identification encoding 
from [13], [15] with the same software as our implementation. 


Experiments are performed on a Quad-Core Intel i7 processor 
clocked at 2.3 GHz and a 32 GB main memory. 

To evaluate the scalability, we randomly generate positive 
and negative examples with varying problem sizes. For (Q1), 
we generate 10 (half of which are positive and half of 
which are negative) partially-ordered task examples with (i) 
2 symbols, and (ii) 4 symbols, and we vary the number of 
DFAs from 2 to 12. For (Q2), we generate 10 to 20 partially- 
ordered task examples with (i) 2 symbols and 4 DFAs, and (i) 
4 symbols and 2 DFAs. Half of these examples are positive and 
the other half is negative. Since the examples are generated 
randomly, we run the experiments for 10 different random 
seeds and report the average. We set the timeout limit to 10 
minutes, and stop when our algorithm timeouts for all random 
seeds. 

Figure la presents the experiment results answering (Q1), 
where we vary the number of DFAs implied by the given 
examples. For partially-ordered tasks with 2 symbols, green 
solid line is the (monolithic DFA) baseline and the blue solid 
is our algorithm. Similarly, for partially-ordered tasks with 4 
symbols, pink dashed line is the baseline and the red dashed 
line is our algorithm. Figure 1b presents the experiment results 
answering (Q2), where we vary the number of examples. For 
partially-ordered tasks with 2 symbols and 4 DFAs, green 
solid line is the baseline and the blue solid is our algorithm; 
for partially-ordered tasks with 4 symbols and 2 DFAs, pink 
dashed line is the baseline and the red dashed line is our 
algorithm. As expected, the baseline scales better than our 
algorithm as we also search for the Pareto frontier and solve 
an inherently harder problem. Notice that given 10 examples, 
our algorithm is able to scale up to 11 DFAs for tasks with 2 
symbols, and 8 DFAs for tasks with 4 symbols; for 2 symbols 
and 4 DFAs, it is able to scale up to 60 examples, and for 4 
symbols and 2 DFAs, it is able to scale up to 190 examples. As 
we demonstrate in the next section, these limits for scalability 
are practically useful in certain domains. 


IV. LEARNING DFAS FROM DEMONSTRATIONS 


Next, we show how our algorithm can be incorporated 
into Demonstration Informed Specification Search (DISS) - 
a framework for learning languages from expert demonstra- 
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Positive | Negative 


(a) A stochastic grid world environment 
with expert demonstrations of an agent try- 


ing to accomplish a task. by DISS. 


(b) Labeled examples conjectured 


(c) Go to [4], 


E mo 


(d) Avoid E 


(£) Monolithic DFA for the example pre- 
7 QO g (2) sented in Section IV. 
E 


(e) After a, go to W before A. 


Fig. 3. Figure 3a shows the stochastic grid world environment. Figure 3b shows the positive and negative examples of the expert’s behavior conjectured by 
DISS and Figures 3c to 3e showcases the associated DFA decomposition identified by our algorithm. Figure 3f shows the monolithic DFA learned in [4]. 


tions [4]. For our purposes a demonstration is an unlabeled 
path through a workspace that maps to a string and is biased 
towards being accepting by some unknown language. For 
example, we ran our implementation of DISS using demonstra- 
tions produced by an expert attempting to accomplish a task in 
a stochastic grid world environment, the same example used in 
[4] and shown in Figure 3a. At each step, the agent can move 
in any of the four cardinal directions, but because of wind 
blowing from the north to the south, with some probability, 
the agent will transition to the space south of it in spite of 
its chosen action. Two demonstrations of the task “Reach A 
while avoiding M. If it ever touches M, it must then touch J 
before reaching [A]? are shown in Figure 3a. 


In order to efficiently search for tasks, DISS reduces the 
learning from demonstrations problem into a series of iden- 
tification problems to be solved by a black-box identification 
algorithm. The goal of DISS is to find a task that minimizes 
the joint description length, called the energy, of the task and 
the demonstrations assuming the agent were performing said 
task. The energy is measured in bits to encode an object. 


Below, we reproduce the results from [4], but using our 
algorithm as the task identifier rather than the monolithic 
DFA identifier provided’. The use of DFA decompositions 
biases DISS to conjecture concepts that are simpler to express 
in terms of a DFA decomposition. To define the description 
length of DFA decompositions, we adapt the DFA encoding 
used in [4] by expressing a decomposition as the concate- 
nation of the encodings of the individual DFAs. To remove 
unnecessary redundancy two optimizations were performed. 
First common headers, e.g. indicating the alphabet size, were 
combined. Second, as the DFAs in a decomposition are 
ordered by size, we expressed changes in size rather than 
absolute size, see Appendix B in the extended version of the 
paper [25] for details. 


3To allow exploring more decompositions, with some probability, the num- 
ber of DFAs in the decomposition was randomly incremented or decremented 
during identification. 


A. Experimental Evaluation 


In Figures 3c to 3e we present the learned DFA decomposi- 
tion along with the corresponding Figure 3b labeled examples 
conjectured by DISS to explain the expert behavior. Impor- 
tantly, this decomposition exactly captures the demonstrated 
task. We note that this is in contrast to the DFA learned 
in [4], shown in Figure 3f, which allows visiting after 
visiting (#]. Further, we remark that the time required to learn 
the monolithic and decomposed DFAs was comparable. In 
particular, the number of labeled examples was less than 60 
and as with the monolithic baseline, most of the time is not 
spent in task identification, but instead conjecturing the labeled 
examples. As we saw with in Section I-D, this number of 
examples is easily handled by our SAT-based identification 
algorithm. Finally, the number of labeled examples that needed 
to be conjectured to find low energy tasks was similar for 
both implementations (see Figures 5 and 6 in Appendix C 
of the extended versoin of the paper [25]). Thus, our variant 
of DISS performed similar to the monolithic variant, while 
finding DFAs that exactly represented the task. 


V. CONCLUSION 


To the best of our knowledge, this work presents the first 
approach for solving DFA-DIP. Our algorithm works by 
reducing the problem to a Pareto-optimal search of the space 
of the number of states in a DFA decomposition with a SAT 
call in the inner loop. The SAT-based encoding is based on 
an efficient reduction to graph coloring. We demonstrated the 
scalability of our algorithm on a class of problems inspired by 
the multi-task reinforcement learning literature and show that 
the additional computational cost for identifying DFA decom- 
positions over monolithic DFAs is not prohibitive. Finally, we 
showed how identifying DFA decompositions can provide a 
useful inductive bias while learning from demonstrations. 
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Abstract—A system may be modelled as an operational model 
(which has explicit notions of state and transitions between 
states) or an axiomatic model (which is specified entirely as 
a set of invariants). Most formal methods (e.g., IC3, invariant 
synthesis, etc) are designed for operational models and are largely 
inaccessible to axiomatic models. Furthermore, no prior method 
exists to automatically convert axiomatic models to operational 
ones, so operational equivalents to axiomatic models had to be 
manually created and proven equivalent. 

In this paper, we advance the state-of-the-art in axiomatic to 
operational model conversion. We show that general axioms in 
the spec axiomatic modelling framework cannot be translated 
to equivalent finite-state operational models. We also derive 
restrictions on the space of pspec axioms that enable the 
feasible generation of equivalent finite-state operational models 
for them. As for practical results, we develop a methodology for 
automatically translating pspec axioms to equivalent finite-state 
automata-based operational models. We demonstrate the efficacy 
of our method by using the models generated by our procedure 
to prove the correctness of ordering properties on three register- 
transfer-level (RTL) designs. 


I. INTRODUCTION 


When modelling hardware or software systems using for- 
mal methods, one traditionally uses operational models (e.g. 
Kripke structures [1]), which have explicit notions of state 
and transitions. However, one may also model a system 
axiomatically, where instead of a state-transition relation, the 
system is specified entirely by a set of axioms (e.g., invariants) 
that it maintains. Executions that obey the axioms are allowed, 
and those that violate one or more axioms are forbidden. The 
vast majority of formal methods works use the operational 
modelling style. However, axiomatic models have been used to 
great effect in certain domains such as memory models, where 
they have shown order-of-magnitude improvements in verifi- 
cation performance over equivalent operational models [2]. 

Operational and axiomatic models each have their own 
advantages and disadvantages [3]. Operational models can be 
more intuitive as they typically resemble the system that they 
are modelling. Hence one is not required to reason about 
invariants to write the model. On the other hand, axiomatic 
models tend to be more concise and potentially offer faster 
verification [2]. 

Many formal methods (e.g., refinement procedures [4], 
invariant synthesis, IC3/PDR [5], [6]) are set up to use op- 
erational models. Axiomatic models are largely or completely 
incompatible with these techniques, as the axioms constrain 
full traces rather than a step of the transition relation. One way 
to take advantage of these techniques when using axiomatic 
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models is to create and use operational models equivalent to 
the axiomatic models. The only prior method of doing this 
was to first manually create the operational model and then 
manually prove it equivalent to the axiomatic model. There 
have been several works doing so [2], [7], [8], [9], [10]. 

Manually creating an operational model and proving equiv- 
alence is cumbersome and error-prone. The ability to auto- 
matically generate operational models equivalent to a given 
axiomatic model would be beneficial, eliminating both the time 
spent creating the operational model as well as the need for 
tedious manual equivalence proofs. Generated models can then 
be fed into techniques currently requiring operational models 
(e.g. IC3/PDR). 

To this end, we make advances in this paper towards 
the automatic conversion of axiomatic models to equivalent 
operational models, on both theoretical and practical fronts. 
In our work, we focus specifically on pspec [11], a well- 
known axiomatic framework for modelling microarchitectural 
orderings, which has been used in a wide range of contexts 
[12], [13], [14], [15], [16] including memory consistency, 
cache coherence and hardware security. 

On the theoretical front, we show that it is impossible 
to convert general juspec axioms to equivalent finite-state 
operational models. However, we show that it is feasible to 
generate equivalent operational models for a specific subset of 
spec (henceforth referred to as juspecRE). On the practical 
side, we develop a method to automatically translate universal 
axioms! in pspecRE into equivalent finite-state operational 
models comprised of building blocks we term as axiom 
automata (finite automata that monitor whether an axiom has 
been violated). Furthermore, for arbitrary spec axioms, our 
method can generate operational models that are equivalent to 
the axioms up to a program-size bound. 

To evaluate our technique, we convert axioms for three RTL 
designs to their corresponding operational models: an in-order 
multicore processor (mult i_vscale), a memory-controller 
(sdram_ctr1), and an out-of-order single-core processor 
(tomasulo). We showcase how the generated models can 
be used with procedures like BMC and IC3/PDR which are 
usually inaccessible for axiomatic models, and we produce 
both bounded and unbounded proofs of correctness. 

Overall, the contributions of this work are as follows: 


e We prove that generation of equivalent finite-state oper- 
ational models for arbitrary uspec axioms is impossible. 


! Axioms that do not contain 3 quantifiers. 
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refinability (Def. 4) 
t-reordering bound (Def. 6) 
extensibility (Def. 7) 


Fig. 1: Roadmap to obtain finite-state operational models. 


finite-state 
operational 
models 


e We provide a procedure for generating equivalent finite- 
state operational models for universal axioms in puspecRE. 

e We propose the axiom-automata formulation to generate 
equivalent finite-state operational models from universal 
axioms in uspecRE (or from arbitrary uspec axioms if 
only guaranteeing equivalence up to a bounded program 
size). 

e We evaluate our method for operational model gen- 
eration by using our generated models to prove the 
(bounded/unbounded) correctness of ordering properties 
on three RTL designs: multi_vscale, tomasulo, 
and sdram_ctrl. 


Generality. While axiomatic models enforce constraints 
over complete executions, operational models do this local 
to each transition. Ensuring that behaviours generated by the 
latter are also allowed by the former requires performing non- 
local consistency checks which are hard to reason about, es- 
pecially for unbounded executions. This has been observed in 
manual operationalization works as well. Taking the example 
of [7], (which operationalizes C11), we address issues of elim- 
inating consistent executions too early [7, §3] and repeatedly 
checking consistency [7, §4] by developing concepts such as 
t-reordering boundedness (Def. 6) and extensibility (Def. 7). 
Though we focus on puspec, we believe many of the underlying 
challenges and concepts carry over to frameworks such as Cat 
[2]. 

Outline. §II covers the syntax and semantics of uspec used 
in this paper. $III covers the formulation of the space of oper- 
ational models we consider. They have finite control-state and 
read-only input tapes for the instruction streams (programs) 
executed by each core. §IV defines our notions of soundness, 
completeness, and equivalence when comparing operational 
and axiomatic models. In §V, we show that it is impossible 
to synthesise equivalent finite-state operational models from 
arbitrary axiomatic models. We develop an underapproxima- 
tion, called t-reordering boundedness, that addresses this by 
bounding the depth of reorderings possible. In §VI we restrict 
spec further by requiring extensibility (preventing current 
events from influencing orderings between previous events). 
Restricting uspec by t-reordering boundedness and extensibil- 
ity is sufficient to enable the automatic generation of equiv- 
alent finite-state operational models (Thm. 2). §VII describes 
our conversion procedure based on axiom automata. §VIII 
evaluates our technique by using it to generate operational 
models, which are then used for checking properties of RTL 
designs. §IX covers related work, and §X concludes, with §XI 
suggesting avenues for future work. This paper is accompanied 
by an extended version which contains supplementary material 
and proofs [17]. 


II. «SPEC SYNTAX AND SEMANTICS 


A. spec Syntax 


:= Vi AX | Ji AX | lii,- jim) 
oA | VO | =o | (atom) 

iy <r i2 | hb(i1.st, ig-st) | P(i1,...) 
:= Fet | Dec | Exe | WB | 


II 


Fig. 2: uspec Syntax. 


spec [11] is a domain-specific language used for spec- 
ifying microarchitectural orderings. A spec model consists 
of axioms that enforce first-order constraints over execution 
graphs; each axiom quantifies over instructions and is required 
to be a sentence (i.e. not have any free variables). Execution 
graphs that satisfy the axioms and are acyclic are deemed 
as valid executions. While ISA-level models [2], [18], [19] 
treat single instructions as atomic entities, spec decomposes 
the execution of an instruction into a set of atomic events. 
Each instruction i and stage st is associated with an event 
ist. A program execution is viewed as a directed acyclic 
graph called a micro-architectural happens-before graph (hb 
graph) [12]. Such a graph for a given program has nodes 
corresponding to events of form i.st for each instruction i in 
the program and each stage st prescribed by the model. Edges 
in the graph correspond to the happens-before (hb) relation: 
hb(e1, e2) says that e, happened before e2. Thus, a cyclic uhb 
graph corresponds to an impossible scenario where an event 
happens before itself, and thus represents an execution that 
cannot occur on the microarchitecture. 


Fig. 2 specifies spec syntax. It has three types of atoms: 


(i) hb(i1.st, ig.st): happens-before predicate 
(ii) i1 <r ig: the reference order (typically the program order) 
Gii) P(iz,...): instruction predicate atoms 


Atoms of type (ii) capture the order in which instructions 
appear in a given program thread. Atoms of type (iii) are pred- 
icates over instructions which capture instruction properties, 
e.g. opcode, source/destination registers. We note that pspec 
models in literature [11] also make use of the NodeExists 
predicate which identifies event nodes that occur in the exe- 
cution. We do not model NodeExists in this paper, but our 
approach can be augmented to incorporate it (see [17]). 


We identify two types of axioms of interest: Universal 
axioms are of the form: Viz---Vix b(i1,-+- , ix), and rep- 
resent constraints applied symmetrically over all tuples of 
instructions in a program. Predicate-free axioms are axioms 
that do not have occurrences of predicate (P) atoms. We extend 
these terms to an axiomatic semantics if all axioms are of 
that type. In this work, our theoretical treatment focuses on 
universal semantics. Practically though, some underlying ideas 
carry over to arbitrary axioms as we discuss in §VII, §VIHI. 
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ax0: V il. hb(il.Exe,il.Com) 

axl: V il,i2. (il<,i2 A DepOn(il,i2)) 
=> hb(il.Exe,i2.Exe) 

ax2: V i1,i2. SameCore(il,i2) => 

(hb (i1.Exe,i2.Exe) Vhb (i2.Exe,il.Exe) ) 

ax3: V il,i2. i1<,i2 => hb(il.Com,i2.Com) 


Fig. 3: An example axiomatic model. 


B. Illustrative spec Example 


Consider the four axioms in Fig. 3. In the axioms, i1, 
i2 are instruction variables and Exe, Com are stage names 
(short for execute and commit respectively). The axiom ax0 
requires that for each instruction, the execute stage (Exe) of 
that instruction must happen before the commit stage (Com). 
Intuitively, ax1 says that when i2 depends on i1 (captured 
by the predicate DepOn) , i1 should be executed before i2; 
ax2 says that the execute events of instructions on the same 
core should be totally ordered by hb. The third axiom ax3 
says that when i1 and i2 are in program order (denoted by 
<r), il must be committed before i2. 

Fig. 4 shows valid and invalid execution graphs for the 
program snippet in Fig. 5. The snippet is of a 2-core program, 
with two instructions per core. Instruction i; is dependent on 
the result of iọ (since its source register is the same as the 
destination of ig). In the example axiomatic semantics, ax1 
requires that the execute event of instruction ig be before that 
of i1. The execution in Fig. 4b is invalid w.rt. ax1 since 
ij.Exe is executed before iọ.Exe. The execution in Fig. 4a is 
valid even though the ig.Exe and i3.Exe events are reordered 
since i3 does not depend on ig. Both executions are valid w.r.t. 
ax0, ax2 and ax3. 


C. Programming Model 


We consider multi-core systems with each core executing 
a straight-line program over a finite domain of operations. 
This is common in memory models [2], [12], [16], [20] and 
distributed systems [21] literature. 

1) Cores: The system consists of n processor cores: 
Cores = [n]. Each core executes operations from a finite set 
©. The axiomatic model A assigns predicates from P an in- 
terpretation over the universe O. We denote this interpretation 
as P4 C OF for an arity-k predicate. 

2) Instruction streams: An instruction stream T is a word 
over O: Z € O*. A program P is a set of per-core instruction 
streams: {Z-}cccores- For a core c and label 0 < j < |Tel, 
we call the triple (c,j,Z.[j]) an instruction*. We denote 
components of instruction i = (c, j, Z.[j]), as: c(i) = c, label 
A(G) = j and operation op(i) = Z.[j]. The set of instruc- 
tions occurring in P is: instrsOf(P) = {(c,j,Zc[7]) | c € 


Note the terminology: operations are commands that the core can execute. 
Since we interpret predicates over O we require |O| to be a finite set for 
computability reasons. Instructions are operations combined with the label 
and core identifier (and hence form an infinite set). 


Cores,0 < j < |Z.|} and the set of all possible instructions 
as I = Cores x Z7° x O. 

3) Instruction stages: Instruction execution in puspec is de- 
composed into stages. The set of stages, Stages, is a parameter 
of the semantics. Instruction i performing in stage st, (i.e. i.st) 
is an atomic event in an execution. The execution of P is 
composed of the set of events: eventsOf(P) = {ist | i € 
instrsOf(P), st € Stages}. The set of all possible events is 
f = {ist | i € I, st € Stages}. 


Definition 1 (Event). An event e is of the form i.st. It 
represents the instruction i € I, (atomically) performing in 
stage st € Stages. 


Example 1. Following the example in Fig. 5 we consider 
an architecture with two opcodes: add, 1w for add and load 
respectively. For each of these, we may have several actual 
operations (with different operands), thus giving us the set O. 
The program P in Fig. 5 has two cores: Cores = {co,c1} 
and four instructions: instrsOf(P) = {io,i1,i2,i3}. We have, 
for example, c(i1) = co, A(i1) = 1, op(i1) = add r3, r2, r1 
while c(ig) = c1, A(ig) = 0. The instruction stream for core 
co is Lp = ig: ty anf that of core cı is Ti = i2 - is. 

Let us suppose that this program is executed on a 4-stage 
microarchitecture with Stages = {Fet, Dec, Exe, WB, Com}. 
The events corresponding to the program are given 
by eventsOf(P) =  {i9-Fet,io.Dec,--- ,i3.Com} with 
jeventsOf(P)| = 4 x 5 = 20. 


D. Formal puspec Semantics 


We now define the formal semantics of uspec axioms. 


Definition 2 (uhb graph). For a program P, a «hb graph 
is a directed acyclic graph, G(V,E), with nodes V = 
eventsOf(P) representing events and edges representing the 
happens-before relationships, i.e. (e,,¢2) € E = hb(e1, e2). 


Validity of hb graph w.r.t. an axiomatic semantics: Con- 
sider an axiomatic semantics A (i.e. a set of axioms). A uhb 
graph G = (V, E) is said to represent a valid execution of 
program P under A if it satisfies all the axioms in A. We 
denote the validity of a uhb graph G by G Ep A. 
Satisfaction w.rt. an axiom: We first define satisfaction for 
the quantifier-free part, starting at the atoms. Let s : I(Ax) > 
I be an assignment for the symbolic instruction variables 
I(AX) in axiom AX. 
G H iils] <r ig[s] — = c(s(i1)) = c(s 
A X(s(i1)) < A(s(i2)) (i) 
G H P(ii, e im)[s] 4> 
(op(s(i1)) + op(s(im))) € P4 
Ge hb(i1.stı, i2.st2)[s] E 
(s(i1).st1, s(i2).st2) e Et 


(ii) 


In (i), the reference order <, relates instructions 11, ig 
from the same instruction stream if i; is before ig. In (ii) 
we extend predicate interpretations, P4, (defined over O) to 
instructions by taking the op(-) component. Finally, hb atoms 
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ig. Exe @—1P>@ i}. Exe in. Exe @<lb_@ j,. Exe 


ig. Com twd i1. Com i2. Com lwt ig. Com 


(a) 


io. Exe a o iz. Exe in. Exe hb ig. Exe 
: | hb | ; : hb. J- 
lọ. Com | ——> 14.Com_ 12. Com © iz. Com 


(b) 


Fig. 4: Valid (a) and an invalid (b) execution graphs for the program in Fig. 5 and axioms in Fig 3. All edges represent the 


hb relation. The red (bold) edge violates ax1. 


io: lw rl, 42(r0) ig: lw r4, 42(r0) 
i1: add r3, r2, r1 ig: add r3, r2, f1 


Fig. 5: Example program snippet 


are interpreted as E*, i.e. transitive closure of Æ, as stated in 
(iii). Operators A, V,— have their usual semantics. 

We now define the satisfaction of a (quantified) axiom AX 
by a graph G, denoted by G p AX above. 


G p ds] =G E dls 
for quantifier-free ¢ 
G Hp Vi ¢[s] =GEp ¢[s[i < il] 
for all i € instrsOf(P) \ range(s) 
=GEp ¢[s[i < i]] 

for some i € instrsOf(P) \ range(s) 


G Fp di 4s] 


The base case is G Ep ¢|s] (where ¢ is quantifier-free) 
and follows the earlier definitions. We extend G Hp @ with 
(almost) usual quantification semantics: Y (4) quantifies over 
all (some) instructions in instrsOf(P). Execution G is a valid 
execution of P under semantics A, denoted as G Hp A, if 
G |p AX for all axioms AX in A. 


III. OPERATIONAL MODEL OF COMPUTATION 


To concretize our claims, we introduce a model of com- 
putation that characterizes the models of interest. We choose 
to focus on finite-state operational models that generate to- 
tally ordered traces, where transitions represent (i.st) events. 
While there are less restrictive models (e.g. event structures 
[22], [23]), such models require specialized, typically under- 
approximate, verification techniques (e.g. [24], [25], [26]). 
Our choice is motivated by the ability to (a) have finite-state 
implementations of generated models (e.g. in RTL) and (b) 
verify against these models with off-the-shelf tools (e.g. model 
checkers using BDD and SMT-based backends). 


A. Model of computation 


Intuitively, the model of computation resembles a 1-way 
transducer [27], [28] with multiple (read-only) input tapes (one 
tape for each instruction stream). This allows us to execute 
programs of unbounded length with a finite control state.* 

1) Model definition: An operational model is parameterized 
by cores Cores, stages Stages, and a history parameter h € 
N U {oo} which bounds the length of tape to the left of the 
head. It is a tuple (Q, A, qinit, qfinal ): 


e Q is a finite set of control states 
3A Kripke structure-based formalism is insufficient since we want to 


execute unbounded programs with distinguished instructions without explicitly 
modelling control logic. 


e A C Qx (Iu {A})!Crres!_ x Q x Act is the transition 

relation where Act is the set of actions 

e Ginit € Q is the initial state 

e final E Q is the final state which must be absorbing (i.e. 

it has a self-loop) 
A model is finite-state if Q is finite, and it has bounded- 
history if h € N. For the end goal of effective verification, 
we are interested in finite-state, bounded-history models since 
it is precisely such models that can be compiled to finite-state 
systems. 

2) Model semantics: A configuration is a triple y = 
(U,q,V) where U : Cores > I*, V : Cores > I* and 
q € Q. Intuitively U (V) represent, for each instruction stream, 
the contents of the input tape to the left (right) of the head 
respectively. For a bounded history machine, a configuration is 
allowed only if |U(c)| < h for all c € Cores. For unbounded 
history all configurations are allowed. 

The set of actions is 


Act = {right(c) | c € Cores} U 


{stay} U 
{sched(c,i,st) | c € Cores, st € Stages, i € [h]} U 
{drop(c, i) | c € Cores, i € [A]} 


Intuitively, these represent in order: motion of the tape head 
for c to the right, silent (no-effect), generation of an event, 
and removing the i*” instruction from the left of the head. We 
provide full semantics in the supplementary material [17]. 

For word w € I*, let fst(w) denote its first element if 
w # e and 4 otherwise. Transitions are enabled based on 
the control state and the instructions that the tape-heads point 
to: transition (q1, (i1, *** ,ijCores|);42,_) € A is enabled in 
configuration y = (U, q, V) if qı = q and fst(V(c)) =i, for 
each c € Cores. 

3) Runs: The initial configuration is given by ynit(P) = 
(Uinit; Ginit, Vinit) where Uinit = àc. € and Vinit = Xc. Le Le. 
for each core, the left of the tape head is empty, and the right 
of the tape head consists of the instruction stream for that core. 
Starting from %jnit(P), the machine transitions according to the 
transition rules. Such a sequence of configurations Yint(P) = 
yo —> y1- $ Ym, where all y; are allowed is called a run. 
A tun is called accepting if it ends in the state final. 

4) Traces: The sequence of event labels o = e1: Em 
annotating a run is the trace corresponding to the run. Each 
label is an event from E and hence o € E*. We view o as 
a (linear) whb execution graph e1 = e2» m em, and 
hence define o — A in the usual way. Accordingly, we will 
sometimes refer to o as an execution of a program P. The set 
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of traces corresponding to accepting runs of an operational 
model M on a program P are denoted as tracesm (P) C E*. 


IV. SOUNDNESS, COMPLETENESS, AND EQUIVALENCE 


We proceed to formalize the notion of equivalence that 
relates axiomatic and operational models. In literature [29], 
[2], ISA-level behaviours of programs have been annotated 
by the read values of load operations. Hence, one notion of 
equivalence might be to require that identical read values be 
possible between the models. While this may be reasonable for 
ISA-level behaviours, it can hide microarchitectural features: 
different microarchitectural executions can have identical ar- 
chitectural results. Given that uspec models executions at the 
granularity of microarchitectural events, we adopt a stronger 
notion of equivalence. For soundness, we require that the 
operational semantics generates linearizations of hb graphs 
that are valid under the axiomatic semantics. Formally: 


Definition 3 (Soundness). An operational model M is sound 
w.rt. A if for any program P, each trace in tracesm(P) is a 
linearization of some uhb graph that is valid under A. 


Before defining completeness, we need to address a subtlety. 
Since operational executions are viewed as uhb graphs by 
interpreting trace-ordering as the hb ordering, the operational 
model always generates linearized hb graphs. However, in 
general, linearizations of valid uhb graphs could end up being 
invalid w.r.t the axioms. Consider Example 2. 


Example 2 (Non-refinable axiom). For the following axiom 
with Stages = {S}, the graph (a) is a valid execution. How- 
ever, both of its linearizations (b) and (c) are invalid. Thus, 
all of the (totally-ordered) traces generated by our operational 
models will be deemed invalid under the axiomatic semantics. 
This renders a direct comparison between operational and 
axiomatic executions infeasible. 


V i1,i2. (nhb(i1.8,i2.8)A7hb(i2.S,1i1.8)) 


To address this issue, we develop the notion of refinability. 
For two uhb graphs G = (V, E) and G’ = (V’, E’), we say 
that G’ refines G, denoted G E G’ if (1) V = V’ and (2) 
(€1, €2) eB = (€1, €2) € Et, 


Definition 4 (Refinable hb). An axiomatic semantics A is 
refinable if for any program P, and hb graph G s.t. G Ep A, 
we have G’ p A for all linear graphs G" satisfying G E G”. 


Refinability says that all linearizations of a valid graph are 
valid. While executions under axiomatic semantics are given 
by (partially-ordered) hb graphs, our class of operational 
models generate totally-ordered traces. Refinability bridges 
this gap by relating valid zhb graphs to valid traces. Interest- 
ingly, we can check whether a universal axiomatic semantics 
satifies refinability, which at a high level, we show via a small 
model property (Lemma 1). 


Lemma 1. Given a universal axiomatic semantics we can 
decide whether the semantics is refinable. 


Refinability is especially important for completeness. For 
non-refinable semantics, validity of linearizations cannot be 
checked based on the axioms, as all linearizations may be 
invalid (Example 2). 

We assume that the axiomatic semantics satisfies 
refinability. 
We define completeness and our formal problem statement. 
Definition 5 (Completeness). An operational model M is 


complete, if for any program P and valid whb graph G Ep A, 
tracesm (P) contains all linearizations of G. 


Formal Problem Statement Given an axiomatic semantics 
A, a set of cores Cores and stages Stages, generate a finite 
state, bounded history model, M = (Q, A, dinit, Gfinal), Which 
satisfies soundness and completeness (Defns. 3 and 5). 


V. ENABLING SYNTHESIS BY BOUNDING REORDERINGS 


In this section, we develop some theoretical results for 
the synthesis of operational models. First, we show that 
synthesis of sound and complete (viz. Defn. 3 and 5) finite- 
state operational models is not possible. Then we provide an 
underapproximation for the completeness requirement, called 
t-completeness, that enables the synthesis of finite-state mod- 
els. This still does not allow for bounded-history models as 
future events can influence past orderings (Example 3). In §VI 
we add extensibility thus enabling our original goal of finite- 
state and bounded-history models. 


A. An impossibility result 


We show that it is in fact impossible to develop a finite-state 
transition system M that satisfies the requirements prescribed 
in Defns. 3 and 5. Figure 6 gives an axiomatic semantics A” 
(with Stages = {S,T}) such that for all possible finite-state 
models, there is some program such that either soundness or 
completeness is violated. In words, the axioms in Fig. 6 state 


ax0: V il. 
axl: V il,i2. 


hb (i1.S,i1.T) 
hb (i1.S,i2.S)=>hb(i1.T,i2.T) 


Fig. 6: Semantics A* that does not allow bounded synthesis 


the following constraints: ax0 says that for each instruction, 
the S stage event happens before the T stage, and ax1 enforces 
that for any two instructions, the ordering between their S 
stage events implies an identical ordering between their T stage 
events. We have the following: 


Theorem 1. For a single-core program P with an instruction 
stream of |Z.,| = m instructions, there is no model M = 
(Q, A, dinits Gfinal) that is sound and complete w.r.t. A* and 
P, and s.t. |Q| < O(2™/m), even with h = œ. 


We provide an intuitive explanation, deferring details to the 
supplement [17]. In valid executions of A*, S stage events 
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can be ordered arbitrarily, while T stage events must maintain 
the same ordering as that of corresponding S stages. Hence the 
machine must remember the S orderings in its finite control. 
However, the number of such orderings grows (exponentially) 
with the number of instructions m, implying that existence of 
a finite-state model that works for all programs is not possible. 


Corollary 1. There does not exist a finite state operational 
model (even with h = co) which is sound and complete with 
respect to the AŤ axioms. 


B. An underapproximation result 


Given the results of the previous section, we must re- 
lax some constraint imposed on the operationalization: we 
choose to relax completeness. To do so, we define an under- 
approximation called t-reordering bounded traces. Intuitively, 
this imposes two constraints: (a) it bounds the depth of 
reorderings between instructions on each core, (b) it bounds 
the number of instructions executed on all other cores, while 
a core is executing a single instruction. 

We observe that (a) is a reasonable assumption since most 
microarchitectures bound reordering depth, often due to finite 
reorder buffers. On the other hand, (b) can be thought of as a 
fairness/starvation-freedom property. 

For two instructions i;,i2 on the same core, let 
diff,.(i1,i2) = A(i2) — A(i1) (recall that A(i) is the instruc- 
tion index of i). Consider a trace ø of program P. For 
i € instrsOf(P), we define the starting index of i, denoted 
as start(i), as the index of the first event of instruction i in ø. 
Similarly we define the ending index, end(i) as the largest 
index for some event of i in o. Let the prefix-closed end 
index of i be the max of end over instructions that are <, i: 
pfxend(i) = max{end(i’) | i’ <» i}. Two instructions i, and ig 
are coupled in a trace (denoted as coup(iz, i2)) if the intervals 
[start(i,), pfxend(i,)], [start(i2), pfxend(iz)| overlap. 


Definition 6 (t-reordering bounded traces). A trace is t- 
reordering bounded if, for any pair of instructions \,,\2 with 
c(i1) = eliz), (1) if iz.stz > iy.sty then diff, (i1,i2) < t and 
(2) if coup(is, i), coup(i, i2) for some i then |diff,(i1, i2)| < t. 


Intuitively, (1) says that an instruction cannot be reordered 
with another that precedes it by > t indices, while (2) says 
that instructions on a core cannot be stalled while more than 
t instructions are executed on another. Note that t-reordering 
boundedness is a property of traces, and not of axioms. We 
now relax completeness (and hence equivalence) to require 
that the operational model at least generate all t-reordering 
bounded linearizations (instead of all linearizations). 


Definition 5* (t-completeness). An operational model M is 
t-complete w.r.t. an axiomatic model A, if for each program P 
and G =p A, tracesy4(P) contains all t-reordering bounded 
linearizations of G. 


Replacing Defn. 5 with its t-bounded relaxation (Defn. 5*) 
addresses the issue of having to keep track of an unbounded 
number of orderings. However, to allow for finite implemen- 
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Fig. 7: P has instruction streams i} -it -i4 i}, i21? i2- i2, and 


i3-if+i3+i. Blue instructions form the prefix P’ (i.e. P! <P) 
and red its residual P” = PP”. The figure shows executions 
G' of P’, G” of P”, and their composition G = G’ > G”. 


tations in practice, in addition to finite-state, we also require 
bounded-history (h € N). This is addressed in the next section. 


VI. ADDING EXTENSIBILITY 


As illustrated by the following example, the t-reordering 
bounded underapproximation is insufficient to achieve 
bounded-history operational model synthesis on its own. 


Example 3 (Need for extensibility). Consider a single stage 
axiomatic semantics: Stages = {S}, and predicate P = {P}. 


V i0,i1,i2. (P(i0,i1,i2) A i0<,i1) 


=> —~hb(il.s,i0.s) 


There cannot be a sound, t-complete, and bounded-history 
(for bound h) model for this axiom (for some t > 1). To 
see this, consider a (single-core) program P, with instructions 
io < i1 +++ iņn4}1. Depending on the instructions in P, the inter- 
pretation P4 of P can either be (a) P4 = {(io, i1,in41)} or 
(b) P4 = {}. In the former case, the ordering i1.S = ig.S is 
invalid while in the latter it is valid. Since we only allow 
a h-sized history, iọ.S must be scheduled before the tape- 
head reaches ip41, i.e. before the machine can determine 
which of (a)/(b) hold. Since the machine cannot determine 
whether events i9.S, i;.S can be reordered, this leads either 
to a model which is unsound (always reorders) or incomplete 
(never reorders). 


Thus, we need an additional restriction to enable generation 
of operational models with a finite history parameter h. 
We propose extensibility, which intuitively states that partial 
executions of program P that have not violated any axioms 
can be composed with valid executions of the residual program 
to generate valid complete executions of P. To do this, we 
extend the notion of validity to partial executions through 
prefix programs. 

A program P can be split into a prefix P’ (blue) and the 
residual suffix P” (red) (Fig. 7). Formally, P’ is a prefix of 
program P, if P’ has instruction streams {Z/}, each of which 
is a prefix of the instr. streams {Z;} of P. We denote that P’ 
is a prefix of P by P’ < P. For programs P, P’ such that 
P! < P we denote the residual of P w.r.t. P! as P” = POP’. 
P’ has instr. streams Z/’: for each core c, Ze = T} - TY. 
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In Fig. 7, for example, the first instruction stream of P is 
iġ -il -i2 i}. The prefix program P’ has (the prefix) i4-i}-i5 as 
its first instr. stream. On the other hand, the residual program, 
P" =P OP’, has the suffix i} as its instruction stream. 

For graphs G’ = (V’, BE’) and G” = (V”, E”), with V’ A 
V” = Ú we define G’ > G” as the graph G = (V, E) where, 
(1) V = V'U V", and 2) E = E'U E" U {(el,e”) |e € 
sink( E’), e” € source( E” )}. The example in Fig. 7 illustrates 
such a composition: we have G = G’ > G”. 


Definition 7 (Extensibility). An axiom AX satisfies extensibil- 
ity if for any programs P and P' s.t. P! < P, and P” = POP’ 
if G) =p: AX and G” =p» AX then @ > G” Ep AX. 
An axiomatic semantics A satisfies extensibility if all axioms 
AX € A satisfy extensibility. 


We require that the axiomatic model satisfies extensibility. 
We define jzspecRE (RE stands for Refinable, Extensible) as 
the subset of juspec in which all axioms are refinable and 
extensible. Finite-state, bounded-history synthesis is feasible 
for universal axioms in uspecRE, as we discuss in the next 
section. Like refinability, we can check whether an axiom 
satisfies extensibility (Lemma 2). 


Lemma 2. Given a universal axiom we can decide whether 
it satisfies extensibility. 


VII. CONVERTING TO OPERATIONAL MODELS USING 
AXIOM AUTOMATA 


In this section, we describe our approach that converts an 
axiomatic model into an equivalent operational model M. In 
§VH-A we develop axiom automata, which are the building 
blocks of our operationalization: they are automata that check 
for axiom compliance as the operational model executes. In 
§VII-B we describe how these automata can be instantiated 
to ensure validity for bounded programs with arbitary uspec 
axioms. §VII-C holds our main result: we describe how axiom 
automata can be instantiated to get a finite-state bounded- 
history model for universal axioms in puspecRE. 

We focus on a single universal axiom Vij,--- 
this can be easily extended to a set of axioms. 


„ikọ, but 


A. Axiom Automata 


In what follows, we fix a (universal) axiom AX = 
Vii- Vig (ii, + ip), and let I(Ax) = {i1,--- , ig}, 
E(AX) = {ist | i € I(AX),st € Stages}. This axiom 
enforces that ¢(-) holds for all k-tuples of instructions in 
the given program. An axiom automaton is a finite state 
automaton that monitors whether ¢(-) holds for a single k- 
tuple of instructions. Our operational model is composed of 
several such automata - thereby allowing us to check all k- 
tuples. We now define axiom automata, starting with some 
auxilliary definitions. 

Let nonhb(AXx) denote the non-hb atoms in 4, i.e. instruc- 
tion predicate applications and <, orderings. A context is an 
assignment (of true/false) to each atom in nonhb(Ax); cxt : 
nonhb(Ax) — B. Each variable assignment s : I(Ax) > I 
fixes the valuation of all nonhb(Ax) atoms (following the 


semantics in §II). Hence each assignment s leads to a unique 
context, which we denote as cxt(s). 

We extend assignments to events and words over events. 
For e = i.st, we define s(e) = s(i).st and for w € E(AX)*, 


s(w) = s(w[0]) --- s(w[|w] — 1]) € E* 


As mentioned in §III-A4, we interpret s(w) € 
graph wo] 2s wi- 2s wlw] — 1. 

Observe that once we fix the context, the validity of ¢(-) 
only depends on the value of the hb atoms in ¢. Hence for two 
assignments sı, 2 with the same context: cxt(s,) = cxt(s2), 
sı and s2 share the same set of valid executions: sı(w) 
satisfies ¢ if and only if s2(w) does. This implies that across 
different assignments s, there are only finitely many valid sets 
of executions over events in s(E(AX)) - one for each context. 
Intuitively, contexts divide the set of all possible assignments 
into classes which admit similar orderings. 

As a consequence of the above, for each AX and context cxt, 
we can construct a finite state automaton that recognizes ac- 
ceptable orderings of E(AX) (Lemma 3). The main observation 
behind Lemma 3 is that once the context (i.e. interpretation of 
the nonhb(Ax) atoms) is fixed, the allowed orderings can be 
represented as a language over the symbolic events E(AX). 


č* as the uhb 


Lemma 3 (Axiom-Automata). Given an axiom AX and con- 
text cxt, there exists a finite-state automaton aa(Ax|cxt]) over 
alphabet E(AX) with language {w | w € E(AXx)P*"™, s(w) H 
(41,°+:,ix)[s] for all s that agree with cxt}. 


B. Deploying axiom automata 


1) Concretization of an axiom automaton: The automaton 
aa(AX[cxt]) mentioned in Lemma 3 recognizes orderings over 
the symbolic alphabet E(AX) that lead to ¢ being satisfied. 
Our end goal, however, is identifying acceptable orderings 
over the (non-symbolic) events E. This requires us to generate 
concrete instances of axiom automata, one for each assignment 
s : I(AX) > I, which we now do. 

Given an assignment s : I(AX) — I, we denote the 
(concretized) automaton for s w.rt AX as aa(AX,s). The 
automaton aa(AX, s) is identical to aa(AX[cxt(s)]), except that 
the symbolic alphabet E(AX) replaced by its image s(E(AX)) 
under s. Intuitively (by §VII-A), the set of valid orderings 
of events in s(E(AX)) is characterized by the context of s, 
cxt(s). This means that the acceptable orderings of events in 
s(E(AX)) is identical to the set of words (orderings) accepted 
by aa(Ax[cxt(s)]), except that the symbolic events E(AX) 
should be replaced by their concrete counterparts, s(E(AX)). 
This justifies the definition of aa(AX, s). 

We extend the notation aa(AX, s) from a single assignment 
to a set of assignments. For J C I, we denote by aa(Ax, J) the 
set of axiom automata over I: {aa(AXx,s) | s : I(Ax) > J}. 

2) A basic operationalization: Lemma 3 and the con- 
cretization defined in §VII-B1 suggest an operationalization 
for AX. For a program P, if a trace ø is accepted by all 
(concrete) automata aa(AX, instrsOf(P)) then ø = ¢[s] holds 
for each assignment s, thus satisfying AX. The number of 
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Fig. 8: Completed prefix (pCM), in-progess (IP) and not- 
fetched postfix (pNF) of instructions during execution. 


these automata is |aa(AX, instrsOf(P))| ~ |instrsOf(P)|* for 
an axiom with k universally quantified variables. Since this 
increases with P, the model is not finite state. Even so, this en- 
ables us to construct operational models for a given bound on 
linstrsOf(P)|. We can do this even for non-universal axioms 
by converting existential quantifiers into finite disjunctions 
over instrsOf(P). We demonstrate an application of this in 
VII, where we check that a processor satisfies an axiom 
ensuring correctness of read values. 


C. Bounding the number of active instructions 


As the discussion from §VII-B2 concludes, generating all 
concrete automata (statically) for arbitrary juspec specifications 
does not give us a finite state model. We need to bound the 
number of automata maintained at any point in the trace. In 
order to do this, for each index in the trace, we identify active 
instructions: an active instruction is one for which we need 
to maintain ordering information at that index. We observe 
that under the t-bounded reordering under-approximation, only 
a bounded number of instructions are active. This, in turn 
implies that we only need to maintain a bounded number of 
axiom automata. We now formalize these concepts. 

For a t-reordering bounded trace o of a program P and a 
trace index 0 < j < |ø], let CM(j) and NF(j/) be instructions 
which have executed all and none of their events at o[j] 
respectively. We define the following auxillary terms: 


pCM(j) = {i | Vi.’ <,i = i € CM(j)} 
pNF(j) ={fi| Y. i< i = > i’ © NF(j)} 
IP(j) = instrsOf(P) \ (pCM(j) U pNF(y)) 


Intuitively pCM(j) represents the prefix-closed set of com- 
pleted instructions, pNF(j) represents the postfix-closed set of 
not-fetched instructions, and IP (j) are the rest - the in-progress 
instructions (see Fig. 8). By the first condition of t-reordering 
boundedness, in-progress (IP) instructions on each core are 
bounded by ¢ for all 7 (Lemma 4): 


Lemma 4. For any t-reordering bounded trace o, for all 0 < 
j < o|, we have, |IP(j)| < |Cores| - t. 


Active instructions Two instructions i,i’ are k-coupled 
in a trace o if they form a coupling chain of length 
k: i.e. there exist instructions i,,--- ,iķ—-1 such that 
coup(i, i1), coup(ii,i2),--- ,coup(i,_1,i’). For trace ø, 0 < 
j < |o| and k € N, we define k-active instructions at j, 
AC;,(j), as instructions from pCM(j) U IP(j) which are k- 
coupled with some instruction from IP(j/). 

Intuitively, for a wspecRE axiom with k universally quan- 
tified variables, the execution of two instructions affect each 


Fig. 9: Experimental setup. 


other only if they are k-coupled. In particular, maintaining 
ordering information is important for instructions which are k- 
coupled with the in-progress instructions. As Lemma 5 shows, 
these active instructions - AC;,(j) - are bounded at any given 
point in the trace. 


Lemma 5. For each k, there is a (program-independent) 
bound bx, s.t. for any t-reordering bounded trace o, for all 
0< j <|o|, we have |ACk(j)| < bx. 


The operational model Our operational model maintains 
the in-progress instructions (IP) on its tape. At each step it 
schedules an event from these instructions. The validity of 
event scheduling is ensured by maintaining orderings between 
events corresponding to the active instructions. Lemmas 4, 5 
imply that at all points in the trace, (1) the set IP is bounded 
and (2) the active instructions - AC; - are bounded (as a 
function of bg). Consequently, this results in a model which 
has finite state (used to maintain orderings between events of 
AC;,) and bounded history (owing to (1)). This gives us the 
main result - a finite state, bounded history operational model. 


Theorem 2. For a (refinable) universal axiomatic semantics 
that satisfies extensibility, synthesis of finite-state, bounded- 
history operational models satisfying Def. 3 and 5* is feasible. 


VIII. CASE STUDIES 


In this section, we demonstrate applications of operational- 
ization. We discuss three case studies: (1) multi_vscale 
[30] is a multi-core extension of the 3-stage in-order vscale 
[31] processor, (2) tomasulo is an OoO processor based on 
[32], and (3) sdram_ctrl1 is an SDRAM-controller [33]. 

For each case, we instrument the hardware designs by 
exposing ports that signal the execution of events (e.g. the PC 
ports in Fig. 9). We convert axioms into an operational model 
M based on the approach discussed in §VII. M is compiled 
to RTL and is synchronously composed with the hardware 
design, where it transitions on the exposed event signals. Thus, 
any violating behaviour of the hardware will lead M into a 
non-accepting (bad) state. Hence by specifying !bad as a 
safety property, we can perform verification of the RTL design 
w.r.t. the axioms. The operationalization approach enables us 
to perform both bounded and unbounded verification using 
off-the-shelf hardware model checkers. We highlight that this 
would not have been possible without operationalization. 

We use the Yosys-based [34] SymbiYosys as the model- 
checker, with boolector [35] and abc [36] as backend solvers 
for BMC and PDR proof strategies respectively. Experiments 
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Instructions PDR BMC (d = 20) 
ALU-R 1m46s 14m30s 
ALU-I 2ml11s 11m31s 

Load+Store 2m18s 13m35s 


Fig. 10: Proof runtimes for (ax1 ^ ax2). 


are performed on an Intel Core i7 machine with 16GB 
of RAM. We use our algorithm to automatically generate 
axiom automata. The compilation of the generated automata 
to RTL and their instrumentation with the design is done 
manually. However, in the future this could be automated 
following the procedure developed in § VI. The experimental 
designs are available at https://github.com/adwait/axiomatic- 
operational-examples. 

Highlights. We demonstrate how the operationalization 
framework enables us to leverage off-the-shelf model checking 
tools implementing bounded and (especially) unbounded proof 
techniques such as IC3/PDR. This would not have been 
possible directly with axiomatic models. Even when Thm. 
2 does not apply (e.g. non-universal/non-extensible axioms), 
following §VII-B2 we can fall back on a BMC-based check 
over all possible programs under a bound on |instrsOf(P)|. 


A. The multi_vscale processor 


a) Pipeline axioms on a single core: We begin with the 
single-core variant of multi_vscale. We are interested in 
verifying the pipeline axioms for this core. The first axiom 
states that pipeline stages must be in Fet—DX-WB order and 
the second enforces in-order fetch. 


axl: Y il. (hb(il.Fet,i1l.DX) A 
hb (i1.DX, i1.WB) ) 
ax2: V i1,i2.i1<,i2 > hb(il.Fet,i2.Fet) 


The setup schematic is given in Figure 9: M is the op- 
erational model implemented in RTL (note that we could do 
this only because the model is finite state and requires a finite 
history A). Given that it is a 3-stage in-order processor, at any 
given point each core has at most 3 instructions in its pipeline 
and we can safely choose a history parameter of h = 3, and 
M is complete for a reordering bound of t = 3. We replace the 
imem_hrdata (instruction data) connection to the core by 
an input signal that we can symbolically constrain. Using this 
input signal, we can control the program (instruction stream) 
executed by the core. 

Verification is performed with a PDR based proof using the 
abc pdr backend. We experiment with various choices of 
instructions fed to the processor (by symbolically constraining 
imem_hrdata). In Fig. 10, we show the constraint and its 
PDR proof runtime, with BMC runtime (depth = 20) for 
comparison. These examples demonstrate our ability to prove 
unbounded correctness. 

b) Memory ordering on multi-core: We now configure 
the design with 2 cores: cg, c1, both initialized with symbolic 
load and store operations. We then perform verification w.r.t. 
the ReadValues (RV) axiom shown below. This axiom says 


J| |AA]| BMCd Time 

4 16 12 3m10s 
6 36 16 15m48s 
8 64 20 1h58m 


Fig. 11: Proof runtimes for the Read-Values axiom for different 
instruction counts (|7|). 


that for any read instruction (i1), the value read should be the 
same as the most recent write instruction (i2) on the same 
address, or it should be the initial value. 


RV: V il,J i2,V i3. IsRead(il) 
(DataInit (i1) V (IsWrite (i2)/A 
SameAddr (i1,1i2) Ahb (i2.DX,i1.DxX) 
AValEq (i1,i2) A ((IsWrite (i3)/A 
SameAddr (i1,i3)) => 
(hb (i3.DX, i2.DX) Vhb (i1.DX, i3.DX))))) 


= 


This not a universal axiom, and hence Thm. 2 does not 
apply. However, for bounded programs we can construct 
|instrsOf(P)|? concrete automata (since there are two univer- 
sally quantified variables: i1, i3) as discussed in §VI-B2. 
We convert the existential quantifier over i2 into a finite 
disjunction over instrsOf(P). We perform BMC queries for 
programs with |Z| = |instrsOf(P)| = 4, 6, 8. 

By keeping instructions symbolic, we effectively prove 
correctness for all programs within our bound |I|. The table 
alongside shows the instruction bound, |I|, the number of 
axiom automata |AA|, BMC depth d, and proof runtime. 
Though our theoretical results apply to universal axioms, this 
shows how an axiom automata-based operationalization can 
be applied to arbitrary axioms by bounding |instrsOf(P)]. 


B. An OoO processor: tomasulo 


Our second design is an out-of-order processor (based on 
[32]) that implements Tomasulo’s algorithm. The processor 
has stages: F (fetch), D (dispatch), I (issue), E (execute), 
WB (writeback), and C (commit). We verify in-order-commit, 
program-order fetch, and pipeline order axioms for this pro- 
cessor. A BMC proof (with d = 20) takes ~2m. 

The axiom axDep given below is crucial for correct execu- 
tion in an OoO processor. It enforces that execute (E) stages 
for consecutive instructions should be in program order if the 
destination of the first instruction is same as the source of the 
second, i.e. dependent instructions are executed in order. 


axDep: V il, i2, 
DepOn (i1, i2)) 


(i1<,i2 A Cons (il,i2) A 
=> hb(il.E,i2.E) 


We add a program counter (pc) to instructions and define 
Cons(ij,i2) = pc(i1) +4 = pc(ig) and DepOn(i;,i2) = 
dest(i1) = srcl(ig) V dest(iz) = src2(ig). 

As before, we compose the operational model M corre- 
sponding to this axiom with the RTL design. We symbolically 
constrain the processor to execute a sequence of symbolic 
(add and sub) instructions and assert !bad. A BMC query 


339 


(d = 20) results in an assertion violation. We manually 
identified the bug as being caused by the incorrect reset of 
entries in the Register Alias Table (RAT) in the Com stage. 
When committing instruction ig, the entry RAT(dest(ig)) is 
reset, while some instruction i; with dest(i9) = dest (i1) 
is issued at the same cycle. A third instruction ig with 
srcl(ig) = dest (io) then reads the result of ig instead of 
i1, Violating the axiom. We fix this bug and perform a BMC 
proof (d = 20), which takes ~6m30s. This demonstrates how 
our techinique can be used to identify a bug, correct it and 
check the fixed design. 


C. A memory controller: sdram_ctrl 


To demonstrate the versatility of our approach, we ex- 
periment with an SDRAM controller [33], which interfaces 
a processor host with an SDRAM device, with a ready- 
valid interface for read/write requests. All intricacies related 
to interfacing with the SDRAM are handled by maintaining 
appropriate control state in the controller. In the following, we 
once again convert axioms into an operational model by our 
technique, and compose the generated model with the design. 

First we verify pipeline-stage axioms for sdram_ctr1 for 
write (4-stages) and read (5-stages) operations executed by the 
host. A PDR-based (unbounded) proof for the pipeline axioms 
requires ~8m. Next we verify properties related to SDRAMs 
refresh operation [37]. The controller ensures that the host- 
level behaviour is not affected by refreshes by creating an 
illusion of atomicity for writes and reads. This results in the 
axiom that once a write or read operation is underway, no 
refresh stage should execute before it is completed. We once 
again prove this property with PDR, which takes ~1m30s. 


IX. RELATED WORK 


There has been much work on developing axiomatic (declar- 
ative) models for memory consistency in parallel systems, at 
the ISA level [2], [38], [39], the microarchitectural level [12], 
[16], [11], and the programming language level [20], [40], 
[41], [42], [43]. There has also been work on construct- 
ing equivalent operationalizations for these models, e.g., 
for Power [2], ARMv8 [10], RA[8], C++ [7], and TSO 
[19], [9]. These constructions are accompanied by hand- 
written/theorem-prover based proofs, demonstrating equiva- 
lence with the axiomatic model. In principle, our work is 
related to these, however we enable automatic generation of 
equivalent operational models from axiomatic ones, eliminat- 
ing most of the manual effort. 

At an abstract level, we have been inspired by classic works 
that have developed connections between logics and automata 
[44], [45]. There is a large body of work on synthesis of 
operational implementations as well as monitors from tempo- 
ral specifications (e.g. [46], [47], [48]), most commonly those 
written in Linear Temporal Logic (LTL) [49] and its variants 
(e.g. [50]). In this paper we perform a similar conversion 
but for a very different logic: uspec specifies constraints over 
partial orders while LTL does so over totally ordered traces. 
Additionally, the elements over which constraints are enforced 


is also different: zspec constrains orderings of a known set of 
events, while LTL does so over traces with potentially differ- 
ing sets of events (atoms). These differences make a direct 
comparison with the previously mentioned works ineffectual, 
and have required us to develop novel concepts in this work. 

In terms of the application to proving properties, the work 
closest to ours is RTLCheck [13], which compiles constraints 
from jspec to SystemVerilog assertions. These assertions 
are checked on a per-program basis. On the other hand, 
we demonstrate the ability to prove unbounded correctness. 
Additionally, for axioms that are not generally operationaliz- 
able (for unbounded programs), we demonstrate the ability to 
generate an operational model for some apriori known bound 
on the program size. In this case, we can verify correctness 
for all programs of size upto that bound, as opposed to on a 
per-program basis as RTLCheck does. RTL2uspec [51] aims 
to perform the reverse conversion: from RTL to uspec axioms. 


X. CONCLUSION 


In this paper we make strides towards enabling greater 
interoperability between operational and axiomatic models, 
both through theoretical results and case studies. We derive 
LspecRE, a restricted subset of the spec domain-specific lan- 
guage for axiomatic modelling. We show that the generation 
of an equivalent finite-state operational model is impossible 
for general spec axioms, though it is feasible for universal 
axioms in uspecRE. From a practical standpoint, we develop 
an approach based on axiom automata that enables us to 
automatically generate such equivalent operational models for 
universally quantified axioms in uspecRE (or for arbitrary 
spec axioms if equivalence up to a bound is sufficient). 

The challenges we surmount for our conversion (discussed 
in §I) find parallels in manual operationalization works [7], 
and we believe that the above concepts can be extended to 
formalisms such as Cat [2]. Our practical evaluation illustrates 
the key impact of this work—its ability to enable users of 
axiomatic models to take advantage of the vast number of 
techniques that have been developed for operational models 
in the fields of formal verification and synthesis. 


XI. FUTURE WORK 


An interesting direction for future work is to enrich juspec 
semantics (e.g., with quantitative operators) such that valid 
executions are guaranteed to satisfy t-reordering boundedness. 
In addition to allowing generation of finite-state operational 
models, we believe that such axioms would also capture 
processor executions more precisely. 

While some aspects of executions are easier to specify 
operationally, others (e.g., non-deterministic scheduling) are 
better suited to axiomatic specifications. Another direction for 
future work is combining operational and axiomatic modelling, 
for example using tools such as UCLIDS5 [52], [53]. 


ACKNOWLEDGMENTS 


This work was supported in part by Intel under the Scalable 
Assurance program and by DARPA contract FA8750-20-C- 
0156. 


340 


[11 


13 


[16 


[17 


[18 
[19 


[20 


[21 


[22 


[23 


[24 


REFERENCES 


Christel Baier and Joost-Pieter Katoen. Principles of model checking. 
2008. 

Jade Alglave, Luc Maranget, and Michael Tautschnig. Herding Cats: 
Modelling, Simulation, Testing, and Data Mining for Weak Memory. 
ACM Trans. Program. Lang. Syst., 36(2), July 2014. 

Yatin A. Manerkar. Progressive Automated Formal Verification of 
Memory Consistency in Parallel Processors. PhD thesis, Princeton 
University, Princeton, NJ, USA, 2020. 

Jerry R. Burch and David L. Dill. Automatic verification of pipelined 
microprocessor control. In CAV, 1994. 

Aaron R. Bradley. SAT-based model checking without unrolling. In 
VMCAI, 2011. 

Niklas Eén, Alan Mishchenko, and Robert K. Brayton. Efficient 
implementation of property directed reachability. 20/7 Formal Methods 
in Computer-Aided Design (FMCAD), pages 125-134, 2011. 
Kyndylan Nienhuis, Kayvan Memarian, and Peter Sewell. An opera- 
tional semantics for C/C++11 concurrency. In OOPSLA, 2016. 

Ori Lahav, Nick Giannarakis, and Viktor Vafeiadis. Taming release- 
acquire consistency. Proceedings of the 43rd Annual ACM SIGPLAN- 
SIGACT Symposium on Principles of Programming Languages, 2016. 
Scott Owens, Susmit Sarkar, and Peter Sewell. A better x86 memory 
model: x86-TSO. In TPHOLs, 2009. 

Christopher Pulte, Shaked Flur, Will Deacon, Jon French, Susmit Sarkar, 
and Peter Sewell. Simplifying ARM concurrency: multicopy-atomic 
axiomatic and operational models for ARMv8. Proceedings of the ACM 
on Programming Languages, 2:1 — 29, 2018. 

Daniel Lustig, Geet Sethi, Margaret Martonosi, and Abhishek Bhat- 
tacharjee. COATCheck: Verifying Memory Ordering at the Hardware- 
OS Interface. In 2/ st International Conference on Architectural Support 
for Programming Languages and Operating Systems (ASPLOS), 2016. 
Daniel Lustig, Michael Pellauer, and Margaret Martonosi. PipeCheck: 
Specifying and verifying microarchitectural enforcement of memory 
consistency models. 2014 47th Annual IEEE/ACM International Sym- 
posium on Microarchitecture, pages 635-646, 2014. 

Yatin A. Manerkar, Daniel Lustig, Margaret Martonosi, and Michael 
Pellauer. RTLCheck: Verifying the memory consistency of RTL designs. 
2017 50th Annual IEEE/ACM International Symposium on Microarchi- 
tecture (MICRO), pages 463-476, 2017. 

Yatin A. Manerkar, Daniel Lustig, Margaret Martonosi, and Aarti Gupta. 
PipeProof: Automated Memory Consistency Proofs for Microarchi- 
tectural Specifications. 2018 51st Annual IEEE/ACM International 
Symposium on Microarchitecture (MICRO), pages 788-801, 2018. 
Caroline Trippel, Daniel Lustig, and Margaret Martonosi. CheckMate 
: Automated exploit program generation for hardware security verifica- 
tion. 2018. 

Yatin A. Manerkar, Daniel Lustig, Michael Pellauer, and Margaret 
Martonosi. CCICheck: Using hb graphs to verify the coherence- 
consistency interface. 2015 48th Annual IEEE/ACM International 
Symposium on Microarchitecture (MICRO), pages 26-37, 2015. 
Adwait Godbole, Yatin A. Manerkar, and Sanjit A. Seshia. Automated 
Conversion of Axiomatic to Operational Models: Theory and Practice. 
https://arxiv.org/abs/2208.06733, 2022. 

Jeremy Manson. The Java memory model. In POPL ’05, 2005. 

Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, 
and Magnus O. Myreen. x86-TSO: A Rigorous and Usable Program- 
mer’s Model for x86 Multiprocessors. Communications of the ACM, 
53:89 — 97, 2010. 

Mark Batty, Scott Owens, Susmit Sarkar, Peter Sewell, and Tjark Weber. 
Mathematizing C++ concurrency. In POPL ’J/, 2011. 

Mustaque Ahamad, Gil Neiger, James E. Burns, Prince Kohli, and 
Phillip W. Hutto. Causal memory: definitions, implementation, and 
programming. Distributed Computing, 9:37—-49, 2005. 

Evgenii Moiseenko, Anton Podkopaev, Ori Lahav, Orestis Melkonian, 
and Viktor Vafeiadis. Reconciling event structures with modern multi- 
processors. ArXiv, abs/1911.06567, 2020. 

Alan Jeffrey and James Riely. On thin air reads towards an event 
structures model of relaxed memory. 2016 31st Annual ACM/IEEE 
Symposium on Logic in Computer Science (LICS), pages 1-9, 2016. 
Brian Norris and Brian Demsky. CDSchecker: checking concurrent data 
structures written with C/C++ atomics. Proceedings of the 2013 ACM 
SIGPLAN international conference on Object oriented programming 
systems languages & applications, 2013. 


341 


25] 


40 


41 


[42] 


43 


50 


51 


Stavros Aronis. 
2018. 

Michalis Kokologiannakis and Viktor Vafeiadis. HMC: Model checking 
for hardware memory models. Proceedings of the Twenty-Fifth Interna- 
tional Conference on Architectural Support for Programming Languages 
and Operating Systems, 2020. 

Jean Berstel. Transductions and context-free languages. 
Studienbücher : Informatik, 1979. 

Jacques Sakarovitch. Elements of automata theory. 2009. 
Luc Maranget, Jade Alglave, Susmit Sarkar, and Peter Sewell. Litmus: 
Running Tests against Hardware. In TACAS’11, 17th International 
Conference on Tools And Algorithms for the Construction and Analysis 
of Systems, Saarbrücken, Germany, March 2011. 

Yatin A. Manerkar. multi-vscale. https://github.com/ymanerka/multi_ 
vscale/tree/multicore. 
LGTMCU. vscale. 
accessed 11-05-2021]. 
Soham-Das-2021. Tomasulo. https://github.com/Soham-Das-2021/ 
Tomasulo-Machine. [Online; accessed 11-05-2021]. 

Stafford Horne. SDRAM controller. https://github.com/stffrdhrn/ 
sdram-controller. [Online; accessed 11-05-2021]. 
Clifford Wolf, Johann Glaser, and Johannes Kepler. 
Verilog Synthesis Suite. 2013. 

Aina Niemetz, Mathias Preiner, and Armin Biere. 
Satisf. Boolean Model. Comput., 9(1):53-58, 2014. 
Berkeley Logic Synthesis and Verification Group. ABC: A system for 
sequential synthesis and verification, release 70930. http://www.eecs. 
berkeley.edu/~alanmi/abc/. 

Bruce Jacob, Spencer W. Ng, and David T. Wang. Memory systems: 
Cache, DRAM, disk. 2007. 

Dennis Shasha and Marc Snir. Efficient and correct execution of parallel 
programs that share memory. ACM Trans. Program. Lang. Syst., 10:282- 
312, 1988. 

RISC-V Foundation. The RISC-V Instruction Set Manual, Volume I: 
User-Level ISA, Document Version 2.2. 

Mark Batty, Alastair F. Donaldson, and John Wickerson. Overhauling 
SC atomics in Cll and OpenCL. Proceedings of the 43rd Annual 
ACM SIGPLAN-SIGACT Symposium on Principles of Programming 
Languages, 2015. 

Viktor Vafeiadis, Thibaut Balabonski, Soham Sundar Chakraborty, 
Robin Morisset, and Francesco Zappa Nardelli. Common compiler 
optimisations are invalid in the C11 memory model and what we can 
do about it. Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT 
Symposium on Principles of Programming Languages, 2015. 

Conrad Watt, Christopher Pulte, Anton Podkopaev, G. Barbier, Stephen 
Dolan, Shaked Flur, Jean Pichon-Pharabod, and Shu yu Guo. Repairing 
and mechanising the JavaScript relaxed memory model. Proceedings of 
the 41st ACM SIGPLAN Conference on Programming Language Design 
and Implementation, 2020. 

Ori Lahav, Viktor Vafeiadis, Jechoon Kang, Chung-Kil Hur, and Derek 
Dreyer. Repairing sequential consistency in C/C++11. Proceedings of 
the 38th ACM SIGPLAN Conference on Programming Language Design 
and Implementation, 2017. 

J. Richard Biichi. On a decision method in restricted second order 
arithmetic. 1990. 

Pierre Wolper, Moshe Y. Vardi, and A. Prasad Sistla. Reasoning about 
infinite computation paths. In 24th Annual Symposium on Foundations 
of Computer Science (sfcs 1983), pages 185-194, 1983. 

Bernd Finkbeiner and Sven Schewe. Bounded synthesis. International 
Journal on Software Tools for Technology Transfer, 15:519-539, 2012. 
Klaus Havelund and Grigore Rosu. Synthesizing monitors for safety 
properties. In TACAS, 2002. 

Klaus Havelund and Grigore Rosu. Efficient monitoring of safety 
properties. International Journal on Software Tools for Technology 
Transfer, 6:158-173, 2003. 

Zohar Manna and Amir Pnueli. The temporal logic of reactive and 
concurrent systems. In Springer New York, 1992. 

Vasumathi Raman, Alexandre Donzé, Dorsa Sadigh, Richard M. Murray, 
and Sanjit A. Seshia. Reactive synthesis from signal temporal logic 
specifications. Proceedings of the 18th International Conference on 
Hybrid Systems: Computation and Control, 2015. 

Yao Hsiao, Dominic P. Mulligan, Nikos Nikoleris, Gustavo Petri, and 
Caroline Trippel. Synthesizing formal models of hardware from RTL 
for efficient verification of memory model implementations. MICRO-54: 


Effective techniques for Stateless Model Checking. 


In Teubner 


https://github.com/LGTMCU/vscale.  [Online; 


Yosys-A Free 


Boolector 2.0. J. 


[52] 


[53] 


54th Annual IEEE/ACM International Symposium on Microarchitecture, 
2021. 

Sanjit A. Seshia and Pramod Subramanyan. UCLIDS5: Integrating 
modeling, verification, synthesis and learning. 2018 16th ACM/IEEE 
International Conference on Formal Methods and Models for System 
Design (MEMOCODE), pages 1-10, 2018. 

Elizabeth Polgreen, Kevin Cheang, Pranav Gaddamadugu, Adwait God- 
bole, Kevin Laeufer, Shaokai Lin, Yatin A. Manerkar, Federico Mora, 
and Sanjit A. Seshia. UCLIDS: Multi-modal formal modeling, veri- 
fication, and synthesis. In Sharon Shoham and Yakir Vizel, editors, 
Computer Aided Verification, pages 538-551, Cham, 2022. Springer 
International Publishing. 


342 


D Formal Methods in Computer-Aided Design 2022 


Formally Verified Quite OK Image Format 


Mario Bucev 
School of Computer and Communication Sciences 
EPFL 
1015 Lausanne, Switzerland 
mario.bucev @epfl.ch 


Abstract—Lossless compression and decompression functions 
are ubiquitous operations that have a clear high-level specifi- 
cation and are thus suitable as verification benchmarks. Such 
functions are also important. On the one hand, they improve the 
performance of communication, storage, and computation. On 
the other hand, errors in them would result in a loss of data. 
These functions operate on sequences of unbounded length and 
contain unbounded loops or recursion that update large state 
space, which makes finite-state methods and symbolic execution 
difficult to apply. 

We present deductive verification of an executable Stainless im- 
plementation of compression and decompression for the recently 
proposed Quite OK Image format (QOI). While fast and easy to 
implement, QOI is non-trivial and includes a number of widely 
used techniques such as run-length encoding and dictionary- 
based compression. We completed formal verification using the 
Stainless verifier, proving that encoding followed by decoding 
produces the original image. Stainless transpiler was also able 
to generate C code that compiles with GCC, is inter-operable 
with the reference implementation and runs with performance 
essentially matching the reference C implementation. 

Index Terms—formal verification, compression, Stainless, SMT 
solver, mechanized induction 


I. INTRODUCTION 


Lossless conversions are ubiquitous. Examples include com- 
pression tools such as zip, as well as lossless image formats 
such as PNG. Unfortunately, common compression formats, 
especially ones for pictures, are more complex than one would 
expect a first. As a result of this complexity and the absence 
of precise specifications, it has proven difficult to reason 
about implementations of these algorithms. Consequently, 
the practice in the field is to use software testing, possibly 
backed by advanced testing algorithms [1], which do not 
guarantee correctness. As a reaction to the complexity of 
existing formats, Dominic Szablewski announced the “quite 
OK image format” [2] on 24 November 2021. The proposal 
was accompanied by a concise and efficient implementation. It 
attracted significant attention, with re-implementations quickly 
emerging in different programming languages (including Ver- 
ilog) as well as variations such as streaming implementations. 

Inspired by these developments, this paper presents an exe- 
cutable and formally verified implementation of the quite OK 
image encoding and decoding algorithms. We have presented 
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this formal development and shared the code on GitHub as 
part of the ASPLOS 2022 tutorial at EPFL in March 2022 
[3], but no reviewed record of the work existed until now. 
The verified case study is now also available at: 


https://github.com/epfl-lara/bolts/tree/master/qoi/ 


We are not aware of a formally verified implementation of 
functional correctness of QOI. Recently, a blog appeared 
referring to an implementation in Ada/SPARK!. Our under- 
standing is that this Ada/SPARK implementation only proves 
the absence of run-time errors and not full correctness. 

In a broader line of work, formal verification was applied 
either to specific algorithms or domain-specific languages. 
The Deflate algorithm [4] specification has been formalized, 
implemented, and verified in [5] in Coq. Researchers also 
formalized common lemmas in information theory in Coq and 
apply these to Shannon-Fano codes [6]. 

Related approaches verify serialization tasks, which do not 
typically aim to compress data. Examples of such work include 
[7] formally verified Protocol buffer compiler implementation 
in Coq, for a commonly used subset of this serialization for- 
mat. Correct by construction pretty printing in parsing libraries 
also ensures correctness subject to certain local invertibility 
conditions [8, Section 6.4], as do invertible lenses [9]. Our 
case study may thus also provide a starting point for exploring 
the expressive power of provably invertible domain-specific 
languages for data transformation. 


II. BACKGROUND 
A. Stainless Verifier and C Transpiler 


Stainless [3], [10]-[12] accepts as input source code in 
a subset of the Scala programming language [13]. Typical 
Stainless programs can thus be compiled using the existing 
Scala compilers and run using the Java Virtual Machine. 

Stainless supports formal verification of assertions, precon- 
ditions, postconditions, and invariants using the Inox solver. 
Inox in turn relies on unfolding of function definitions and 
uses SMT solvers, notably Z3, CVC4, and Princess. 

Stainless also supports generation of C code (transpilation) 
for a subset of Scala. This subset targets programs without 
heap-allocated memory, in the spirit of our previous case 
study [14]. We wrote our QOI format case study to meet the 
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expectations of the C code generator; it is the generated C code 
that we use for the performance comparison (Section IV-C). 


B. QOI Format Overview 


To encourage subsequent verification efforts and compar- 
isons, we summarize here the QOI format definition. The 
format is structured with a header, followed by the actual data, 
and terminated by a marker (7 zero bytes followed by 0146). 
Table I describes the header format. Images are encoded in a 
row-major order (left-to-right, top-to-bottom). 

QOI encoder is single-pass. It manipulates the following 
data structures: 


e The image to encode pixels. Each pixel is constituted of 
chan bytes. 

e The current index pxPos within pixels (multiple of chan), 
the current pixel px, as well as the previous pixel pxPrev 
(initialized to R = G = B = 0 and A = 255). 

e The encoded image bytes and the output position outPos 
within bytes. 

e index, an array of 64 pixels denoting previously-seen 
pixels. It is zero-initialized. 

e run, counting the number of equal consecutive pixels 
(initialized to 0). 

In the following, we write px.r, px.g, px.b, px.a to refer to 
the red, green, blue, and alpha channels of a pixel px. When a 
pixel does not have an alpha channel, we default px.a to 255. 

Each pixel is encoded in one of four different cases, two of 
which have two subcases. Encoded pixels are written in tagged 
chunks, uniquely identifying the applied (sub)case. The details 
of the chunk formats and computations can be found in [2]. 

Case A. If px = pxPrev, we increment the run counter. 
Whenever it reaches 62, we write a run chunk, reset run to 
0 and continue with the next pixel. 

Otherwise, if px ~ pxPrev and run > 0, we write a run 
chunk as well, reset run to 0 and proceed to encode px using 
the remaining three methods. 

Case B. We compute a hash of the current pixel px, denoted 
by colorPos(px). The hash function is set by the QOI standard 
and yields a non-negative number smaller than 64. Then, if 
index(colorPos(px)) = px, we write an index chunk using the 
computed position and proceed with the next pixel. Otherwise, 
we update index(colorPos(px)) with px and encode px using the 
two remaining methods. 

Cases C.i and C.ii. The idea is to encode a difference 
between the current and previous pixel, provided the difference 


TABLE I 
QOI FILE HEADER STRUCTURE. OFFSET AND SIZE ARE GIVEN IN BYTES. 
Name | Offset | Size | Description 
Magic 0 4 | qoif to indicate a QOI image 
w 4 4 | Image width in pixels (in big-endian) 
h 8 4 | Image height in pixels (in big-endian) 
chan 12 1 Channels: 3 for RGB; 4 for RGBA 
Color space 13 1 | 0: sRGB with linear alpha, 1: all chan- 


nels linear (informative) 


magic number w=3 chan = 3 


fear e [al olo ele|a[elele|al 
Payload Resulting image 


soon a m Fa 


Run of 3. Luma RGB Index 


End marker 00 | 00 | 00 | 00 | 00 | 00 | 00 | o1 | 


Fig. 1. 


Example of a Compressed Image in QOI format 


is “small enough”. This case comes with two variants: the diff 
subcase (C.i) with a chunk size of | byte and the /uma subcase 
(C.ii) for larger magnitudes with a chunk size of 2 bytes. 

Cases D.i and D.ii. Whenever all above cases do not apply, 
we resort to encoding the full RGB value if px.a = pxPrev.a 
(D.i) or the full RGBA value otherwise (D.ii). 

Decompression is single-pass as well and maintains the 
same data structures as the compression counterpart. The 
decoder iterates over all chunks and applies the reverse trans- 
formation. 

Example of decoding an image. Consider the encoded QOI 
image depicted in fig. 1. Squares denote bytes in hexadecimal 
while thick black boxes delimit the chunks. Though this figure 
actually transcribes the shown 3 x 2 image in the QOI format, 
knowing the exact details of the computations is unnecessary 
for this discussion. 

The decoder starts with a black and opaque pxPrev. It reads 
the first data byte (C216) and uniquely identifies a run chunk 
indicating to repeat the previous pixel pxPrev 3 times (case A). 
The decoder then proceeds with the next chunk. 

The following 9Aig signals this byte and the following 
one, E816, constitutes a luma chunk (case C.ii). The decoder 
computes a cyan? pixel based on the previous pixel and the 
differences stored in this chunk. Before moving on, this pixel 
is stored in index at the position given by colorPos(-). 

Next, FE;, identifies an RGB chunk (case D.i) with three 
following repeating bytes D216, producing a light gray pixel. 
The decoder computes a position for this pixel and stores it 
in index (which happens to not collide with the previous cyan 
pixel). 

Finally, 2Dı6 specifies an index chunk (case B) with the 
position of the cyan pixel decoded previously. 


III. VERIFICATION APPROACH 


We proved two classes of properties (memory safety is 
ensured by the programming language model): 
e Runtime safety: for any input, the encoder and decoder 
do not access arrays out of bound or throw exceptions. 
e Correctness: decoding is the inverse of encoding (invert- 
ibility). 
It is much less work to show only the first property, so we 
focus our presentation on the second one. 


2 : . 
Dark gray in monochromatic. 
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To prove correctness, we proceed by “running” the encoder 
on an arbitrary but fixed input and decode the image at the 
same time as it is encoded. Once we are finished, the decoded 
image must be the same as the original one. 

We establish not only separate invariants for the encoder and 
decoder’s respective states, but also an invariant that ties them. 
For example, if the encoder encounters a sequence of repeating 
pixels (case A), it delays writing down the chunk until the end 
of this sequence. In such a case, the decoder is expected to 
lag behind the encoder. On the other hand, for cases B, C and 
D, both the encoder and decoder are expected to advance at 
the same pace and are, in some sense, synchronized. 

Then, given encoder and decoder states satisfying the in- 
variants, we show that encoding a single pixel and decoding it 
should give the same pixel while maintaining these invariants. 
We then generalize this result to the entire image, leveraging 
induction. 

To describe invertibility in Stainless, we write plain Scala 
code in terms of encode and decode, and provide the appro- 
priate conditions. Before presenting the inversion theorem, we 
deem it helpful first to introduce some definitions. 

The following snippet contains the declarations of three 
records (or case classes in Scala’s terminology). For concise- 
ness, we abbreviate a: T, b: T, c: T to a, b, c: T below. 


// Encoding context 

case class EncCtx(pixels: Array[Byte], w, h, chan: Long) { 
// invariants on the fields (only one conjunct shown) 
require(pixels.length == w + h + chan) 


case class EncodedResult(encoded: Array[Byte], length: Long) 
case class DecodedResult(pixels: Array[Byte], w, h, chan: Long) 
EncCtx contains the input of the encoder: the image (pixels, 
an array of RGBA bytes) as well as its dimensions and the 
number of channels. As these values may not be arbitrary 
(for instance, we must have pixels.length == w + h + chan), we add 
a require clause that specifies an invariant over these fields. 
Stainless then injects these assumptions into proofs when the 
values of the type appear in verification conditions. 
EncodedResult, as its name suggests, holds the result of the 
encoding process. As encoded must be big enough to account 
for the worst case, the length field indicates the effective size 
of the compressed image. 
We can now state the “invertibility theorem” with the 
decodeEncodelsidentityThm function in the snippet below’. 
def encode(ctx: EncCtx): EncodedResult = ... 


def decode(bytes: Array[Byte], /* exclusive end index for decoding: «/ 
until: Long): Option[DecodedResult] = ... 


def decodeEncodelsldentityThm(ctx: EncCtx): Boolean = { 
val res = encode(ctx) 
decode(res.bytes, res.length) match 
case Some(DecodedResult(decoded, w, h, chan)) => 
w == cix.w && h == ctx.h && chan == cix.chan && 
// Predicate for comparing arrays within a range 
arraysEq(ctx.pixels, decoded, 0, pixels.length) 
case None() => false // i.e. should be unreachable 
}.holds 


3For brevity of presentation, code and specification snippets may slightly 
differ from the actual case study available on the URL shown in the 
introduction. 


The .holds construct in decodeEncodelsidentityThm asks Stain- 
less to prove the following. Given a valid EncCtx — representing 
the encoder input — satisfying its stated invariant, if we feed the 
result res of the encoder to the decoder, it always succeeds (by 
having case None() returning false). Additionally, the decoded 
dimensions and number of channels correspond to the original 
input. Furthermore, the original and decoded images are equal. 

To help Stainless prove this theorem, we must establish 
contracts for several functions, provide sufficient proof an- 
notations to guide the solver, and write lemmas — which are 
just (possibly recursive) functions stating a property. However, 
decodeEncodelsidentityThm does not contain any proof annota- 
tion, as everything needed to derive the conclusion is contained 
in the definitions of encode and decode. 

In fact, encode and decode contain few annotations. They 
delegate the work (alongside the proofs) to encodeLoop and 
decodeLoop. In particular, encodeLoop iterates (through recur- 
sion) over the pixels and invokes encodeSingleStep for the actual 
work. By stating a sufficiently strong induction hypothesis (IH) 
on encodeLoop and combining the IH with the properties of 
encodeSingleStep, we obtain proof of invertibility. 

As encodeLoop is “just” gluing the pieces together, we 
instead present encodeSingleStep: 


// Pixel read from the pixels array, updated output 
// position within the bytes array and updated run. 
case class Encodinglteration(px: Int, outPos, run: Long) 


// Contains the state of the decoder, that is mutated 
// in encodeSingleStep (‘var' marks a field as mutable). 
case class GhostDecoded(var index: Array[Int], 

var pixels: Array[Byte], var inPos, var pxPos: Long) 


def encodeSingleStep(index: Array[Int], bytes: Array[Byte], 
pxPrev: Int, runO, outPosO, pxPos: Long, ctx: EncCtx, 
@ghost decoded: GhostDecoded): Encodinglteration = // ... 


encodeSingleStep returns Encodinglteration that gives the last 
read pixel (px) and one-past-the-end position of the last written 
byte (outPos). For a sequence of repeating pixels, the run field 
of the returned record is incremented. Otherwise, the encoded 
pixels are written (in-place) in bytes and outPos is updated 
accordingly. 

Notably, encodeSingleStep takes a ghost parameter, decoded, 
which models the decoder state that would arise during 
possible future decoding runs. Ghost variables are subject 
to ghost elimination, which we discuss in more detail in 
IV-C. Intuitively, ghost variables allow tracking some extra 
information that may only be used for contracts and proof 
annotations: in particular, they cannot influence the execution 
of the algorithm [15]. 

The precondition of encodeSingleStep requires that the de- 
coder state is consistent: for instance, the currently decoded 
pixels correspond to the original ones. At the end of the 
function, before returning, we “run” the decoder on decoded 
by calling decodeLoop with the updated index and bytes arrays. 

Then, we can express local invertibility as follows. If we 
run the decoder from the old decoded state (i.e. before enter- 
ing encodeSingleStep) on the bytes we wrote when executing 
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encodeSingleStep, then the decoded pixels must correspond to 
the pixels that have been encoded. 

To prove this key property, we proceed in two phases, akin 
to how the encoder proceeds. The snippet below shows an 
excerpt of the encodeSingleStep, highlighting these two phases. 


// Record returned by updateRun 
case class RunUpdate(reset: Boolean, run, outPos: Long) 


def encodeSingleStep(...) = { 
// ... Some preconditions 
// A copy of the "original” index, will be erased by ghost elimination: 
@ghost val oldindex = freshCopy(index) 
// Phase 1: Run-length processing (case A) 
val runUpd = updateRun(bytes, run0, outPos0) 
val run1 = runUpd.run 
val outPos1 = runUpd.outPos 
// The premise holds when flushing (writing down the run chunk) 
assert(runUpd.reset ==> 
updateRunProp(pxPrev, px, bytes, run0, outPosO, outPos1)) 
// ... other assertions 
// Phase 2: Encode pixel individually (cases B, C, D) 
val outPos2 = if px != pxPrev then 
val outPos2 = encodeNoRun(index, bytes, outPos1) 
// ... some assertions and lemmas to support this claim 
assert(encodeNoRunProp(pxPrev, px, oldIndex, index, bytes, 
outPos1, outPos2)) 
outPos2 
else 
// ... assertions stating invariants are preserved 
outPos1 
// ... assertions to glue everything together 
Encodinglteration(px, outPos2, run1) 
}.ensuring(/+ postconditions stating distilled properties «/) 


First, the encoder handles the run-length part of the algo- 
rithm, corresponding to case A as described in II-B. The work 
is delegated to updateRun and returns a record telling (through 
the reset field) whether a run chunk was written to bytes. If 
not, then invertibility is of course preserved as the encoded 
pixels are left untouched. Otherwise, updateRun guarantees 
that reading the written chunk gives us a run chunk whose 
value is the run counter we have just written — expressed with 
updateRunProp, presented afterward. 

Second, in the case where the previous and current pixels 
are different, the encoder picks methods B, C or D to encode 
the current pixel. The task is handed over to encodeNoRun and 
states with encodeNoRunProp that reading the written chunk 
yields back the pixel. 

updateRunProp and encodeNoRunProp both use doDecodeNext 
to decode the written chunk. The latter returns an ADT with 
two variants describing the decoded chunk. Run(r) indicates a 
run chunk with r + 1 repeating pixels. The +1 is a result 
of the run counter being shifted by one when encoded. 
DiffOrlndexOrColor(px) denotes a pixel encoded by method B, C 
or D. Due to the variable length nature of chunks, doDecodeNext 
also returns the position of the next chunk to be decoded (if 


any). 


enum DecodedNext: 
case Run(run: Long) 
case DiffOrlndexOrColor(px: Int) 


def doDecodeNext(bytes, index: Array[Int], 
pxPrev: Int, inPosO: Long): (DecodedNext, Long) = ... 


Expressing the desired properties is then a matter of pattern- 
matching over the result of doDecodeNext and tying it with 
appropriate equalities. 
def updateRunProp(pxPrev, px: Int, bytes: Array[Byte], 

runO, outPosO, outPos1: Long): Boolean = 

// ... preconditions including e.g. ordering on outPosO, outPos1 

// lf px == pxPrev, the current run counter run0 is incremented 

// (reflected by the conditional +1). 

val run = run0 + bool2int(px == pxPrev) 

// The index does not matter for this case, we give an arbitrary array 
val dummylndex = Array. fill(64)(0) 

doDecodeNext(bytes, dummylIndex, pxPrev, outPos0) match 

case (Run(r), inPos) => r+ 1 == run && inPos == outPos1 
case _ => false 


// oldIndex refers to the index at the beginning of encodeSingleStep 
def encodeNoRunProp(pxPrev, px: Int, oldIndex, index: Array[Int], 
bytes: Array[Byte], outPos1, outPos2: Long): Boolean = 
// ... preconditions including e.g. ordering on outPos1, outPos2 
doDecodeNext(bytes, oldIindex, pxPrev, outPos1) match 
case (DiffOrlndexOrColor(decodedPx), inPosRes) => 
decodedPx == px && inPosRes == outPos2 && 
oldIndex.updated(colorPos(px), px) == index 
case _ => false 


We rely on Inox (Stainless’ underlying solver) to unfold 
function definitions to prove that the calls to updateRunProp 
and encodeNoRunProp in encodeSingleStep hold. To help with 
the proof, we also provide assertions whose content is similar 
to the properties stated by updateRunProp and encodeNoRunProp. 

Now that we have these two invertibility properties, we 
show that the composition of these two phases preserves 
invertibility by tying all facts together (see the end of the body 
of encodeSingleStep in the source code of encoder.scala). 


IV. RESULTS 


We first present some statistics and remarks about the ver- 
ification before considering the performance of the generated 
C code with respect to the reference implementation. 

For all experiments, we used a server with 2x 
Intel® Xeon®CPU E5-2680 v2 at 2.80GHz (release date Q3’ 13, 
for a total of 20 physical cores) running on Ubuntu 20.04.3 
LTS. 


A. Verification Statistics 


Our QOI implementation in Scala without annotations con- 
sists of 313 lines of code (LOC)*. The annotated version has 
2789 LOC, of which 1405 are for lemmas and helpers. This 
yields a ratio of 8.9 lines of specifications per executable line. 
The specification lines include 42 lemmas, 19 of which are 
general purpose and could become part of the standard library. 

Table II shows for each category of verification condition 
(VC) their respective numbers and their cumulative times. It 
took roughly 1h30min to verify all VCs. The lower quartile, 
the median, and the upper quartile are 0.5s, 1.8s, and 5.7s 
respectively. Around 9.5% of VCs took more than 30s to 
verify, the highest being 3min. 

For each function call, Stainless generates VCs correspond- 
ing to the function preconditions. Assertions annotations and 
postconditions of functions are translated into VCs as well. 


4Counted with cloc v1.82 
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TABLE II 
SUMMARY OF THE VERIFICATION CONDITIONS. 


Verification Condition # Total time [min] 
Preconditions 2387 370.9 
Body assertions 787 203.3 
Postconditions 145 31.2 
Array index within bounds 126 4.9 
Remainder by zero 87 10.6 
Non-negative measure 23 2.1 
Class invariant 21 1.5 
Cast correctness 6 0.1 
Match exhaustiveness 5 0.4 
Measure decreases 4 4.4 
Total 3591 629.4 


Stainless furthermore generates other runtime safety verifica- 
tion conditions, such as array bounds checks and remainder 
by zero checks. It is sometimes necessary to provide sufficient 
annotations (e.g., assertions and invariants) to help Stainless 
prove these VCs. 


B. Verification Effort 


The case study was implemented and formally verified by 
the first author (who had a few months of experience with 
Stainless) over the period of approximately 4 to 5 weeks. 

We have first implemented a version closely following the 
C reference version. Though we could prove runtime safety, 
describing deeper properties turned out to be difficult. For 
example, we could not refer to the result of decoding a range, 
but only the end-to-end decompression result of the entire 
image. 

We have thus rewritten the implementation multiple times 
making both small and larger changes. Since the encoder 
and decoder are succinct, the rewrites took a relatively small 
amount of time compared to the remaining verification effort. 

During repeated verification runs, the VC cache and the 
ability to selectively verify only provided functions greatly 
speed up the interactive experience. For example, making a 
few changes to a previously verified version requires less than 
two minutes to check all VCs, compared to the 1h30min for 
a clean-state re-run. 


C. Generated C Code and Its Efficiency 


We compare the encoding and decoding throughput of the 
transpiled C code with the reference implementation. Though 
the primary goal of the reference is simplicity, its decoding 
and encoding throughput are respectively 3.4x and 29x higher 
than libpng while achieving a similar compression ratio’. 

As briefly mentioned in IV-B, we make use of ghost states 
for proving invertibility. Stainless first checks for correct usage 
of ghost variables before eliminating them in a phase of the C 
transpiler. Assertions and functions contracts are removed as 
well®. In summary, “proof infrastructure” is erased and incurs 
no cost at runtime. 


5Derived from the section “Grand total for images (AVG)” at 
https://qoiformat.org/benchmark/ (consulted the 11.08.2022). 
To ensure removal, developers should import the Stat icChecks library. 


The generated C code is 661 LOC long, against 311 for the 
reference implementation. For the purpose of evaluation, we 
also wrote unverified glue C code that performs I/O. We do 
not make any correctness claims about this code, only about 
the part that converts arrays of bytes between uncompressed 
and compressed form. We evaluated the throughput of the 
generated C code (genc-—qoi) against the reference imple- 
mentation (qoi) using a modified version of the benchmark 
utility shipped with qoi. We run the benchmark with 3 runs 
over 7 images ranging from 3 to 13.8 megapixels, and report 
the result in table II. 

We compiled all involved C sources using GCC 11.1.0 
with —O3. As our implementation uses tail recursion, so 
does the generated C code’. It is necessary to pass an 
optimization level of at least -02 or explicitly pass the 
-foptimize-sibling-—calls to GCC in order to have 
the tail calls eliminated. 

To our surprise, the transpiled version is on-par with the 
reference implementation: it is approximately 7% faster in 
decoding and 2% slower in encoding. Disassembling the 
decoding functions reveals that both were compiled similarly. 
Nevertheless, the genc-—qoi version uses more instructions 
for all cases but index decoding (case B). These extra instruc- 
tions are of an arithmetic and logical nature and do not involve 
memory operations. For case B, GCC produced one 4-bytes 
memory load operation for genc—qoi, while it emitted four 
1-byte memory load operations for qoi. We conjecture that 
the reported difference may be explained by these three extra 
memory loads. 


TABLE III 
BENCHMARK RESULTS OF QOI AND GENC-QOI 


Decoding throughput Encoding throughput 


[megapixels/s] [megapixels/s] 
qoi (unverified) 90.92 86.24 
genc-qoi (verified) 97.65 84.45 


V. CONCLUSIONS 


We have presented a QOI implementation in Scala and 
verified with Stainless that decoding is the inverse of encoding. 
We have also seen that the transpiled C version matches the 
performance of the reference implementation. Going forward, 
we expect that other verified implementations will emerge 
and that QOI will become a useful benchmark for testing 
verification approaches and tools. 


ACKNOWLEDGMENT 


We thank FMCAD 2022 reviewers for helpful comments. 
We thank Georg S. Schmid for useful discussions and Jad 
Hamza for developing the C code generator in Stainless. We 
thank the organizers of ASPLOS 2022 conference for the 
opportunity to present a summary of the case study as one 
part of the tutorial. 


7We thank GCC! Our C code generator does not (yet) eliminate tail calls. 


347 


[1] 


[7] 


[12 
[13 


[14 


[15 


REFERENCES 


A. Kanade, R. Alur, S. Rajamani, and G. Ramanlingam, “Representation 
dependence testing using program inversion,’ in Proceedings of the 
Eighteenth ACM SIGSOFT International Symposium on Foundations of 
Software Engineering, ser. FSE’10. New York, NY, USA: Association 
for Computing Machinery, 2010, p. 277-286. [Online]. Available: 
https://doi.org/10.1145/1882291.1882332 

“The Quite OK Image format for fast, lossless compression.” [Online]. 
Available: https://qoiformat.org/ 

“Verifying programs with Stainless (ASPLOS 2022 tutorial on Stain- 
less.” [Online]. Available: https://epfi-lara.github.io/asplos2022tutorial/ 

L. P. Deutsch, “DEFLATE Compressed Data Format Specification 
version 1.3,” Internet Engineering Task Force, Request for Comments 
RFC 1951, May 1996, num Pages: 17. [Online]. Available: https: 
//datatracker.ietf.org/doc/rfc 1951 

C.-S. Senjak and M. Hofmann, “An implementation of deflate in coq,” 
2016. [Online]. Available: https://arxiv.org/abs/1609.01220 

R. Affeldt, J. Garrigue, and T. Saikawa, “Examples of Formal 
Proofs about Data Compression,” in 20/8 International Symposium on 
Information Theory and Its Applications (ISITA). Singapore: TEEE, 
Oct. 2018, pp. 633-637. [Online]. Available: https://ieeexplore.ieee.org/ 
document/8664276/ 

Q. Ye and B. Delaware, “A verified protocol buffer compiler,’ in 
Proceedings of the 8th ACM SIGPLAN International Conference on 
Certified Programs and Proofs, ser. CPP 2019. New York, NY, USA: 
Association for Computing Machinery, 2019, p. 222-233. [Online]. 
Available: https://doi.org/10.1145/3293880.3294 105 

R. Edelmann, “Efficient parsing with derivatives and zippers,” 
Ph.D. dissertation, EPFL, Lausanne, 2021. [Online]. Available: 
http://infoscience.epfl.ch/record/287059 

M. Hofmann, B. Pierce, and D. Wagner, “Symmetric lenses,” in Pro- 
ceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on 
Principles of Programming Languages, ser. POPL °11. New York, NY, 
USA: Association for Computing Machinery, 2011, p. 371-384. 

J. Hamza, N. Voirol, and V. Kunéak, “System FR: Formalized foun- 
dations for the Stainless verifier,’ Proc. ACM Program. Lang, no. 
OOPSLA, November 2019. 

V. Kuncak and J. Hamza, “Stainless verification system tutorial,” in 
Formal Methods in Computer Aided Design, FMCAD 2021, New Haven, 
CT, USA, October 19-22, 2021. TIEFE, 2021, pp. 2-7. 

“Stainless,” 2022. [Online]. Available: https://github.com/epfi-lara/ 
stainless/ 

M. Odersky, L. Spoon, B. Venners, and F. Sommers, Programming in 
Scala (Fifth Edition, Updated for Scala 3.0). Artima Press, 2021. 

J. Hamza, S. Felix, V. Kunéak, I. Nussbaumer, and F. Schramka, “From 
verified Scala to STIX file system embedded code using Stainless,” 
in NASA Formal Methods (NFM), 2022, p. 18. [Online]. Available: 
http://infoscience.epfi.ch/record/292424 

M. Abadi and L. Lamport, “The existence of refinement mappings,” 
in Proceedings of the 3rd Annual Symposium on Logic in 
Computer Science, July 1988, pp. 165-175, IICS 1988 Test of 
Time Award. [Online]. Available: https://www.microsoft.com/en-us/ 
research/publication/the-existence- of-refinement-mappings/ 


348 


D Formal Methods in Computer-Aided Design 2022 


Split Transition Power Abstraction for Unbounded Safety 


Martin Blicha*t@®, Grigory Fedyukovich'@®, Antti E. J. Hyvarinen*(® and Natasha Sharygina* ®© 
*Università della Svizzera Italiana, Lugano, Switzerland 
{blichm, hyvaeria, sharygin}@usi.ch 
İFlorida State University, Tallahassee, FL, USA 
grigoryécs.fsu.edu 
Charles University, Prague, Czech Republic 


Abstract—Transition Power Abstraction (TPA) is a recent sym- 
bolic model checking approach that leverages Craig interpolation 
to create a sequence of symbolic abstractions for transition paths 
that double in length with each new element. This doubling 
abstraction allows the approach to find bugs that require long 
executions much faster than traditional approaches that unfold 
transitions one at a time, but its ability to prove system safety 
is limited. This paper proposes a novel instantiation of the TPA 
approach capable of proving unbounded safety efficiently while 
preserving the unique capability to detect deep counterexamples. 
The idea is to split the transition over-approximations in two 
complementary parts. One part focuses only on reachability 
in fixed number of steps, the second part complements it by 
summarizing all shorter paths. The resulting split abstractions 
are suitable for discovering safe transition invariants, making 
the SPLIT-TPA approach much more efficient in proving safety 
and even improving the counterexample detection. The approach 
is implemented in the constrained Horn clause solver GOLEM 
and our experimental comparison against state-of-the-art solvers 
shows it to be both competitive and complementary. 


I. INTRODUCTION 


Automated formal verification by means of model checking 
is popular because of the ability to both 1) find error paths 
for unsafe systems, and 2) prove the absence of error paths 
for safe systems. Recent techniques based on Satisfiability 
Modulo Theories (SMT) as well as the continuing improve- 
ments of SMT solvers [1, 12, 16, 27, 35] enable scalable 
applications of model checking to software verification [3]. 
Specifically, the idea of building a safe inductive invariant 
incrementally—pioneered by the hardware model checking 
algorithm IC3/PDR [8, 17]—has been successfully applied in 
several IC3-inspired approaches [10, 11, 18, 24, 29, 30], thus 
improving the capabilities of verification tools significantly. 

Although this progress is undeniably encouraging, model 
checking still suffers from scalability issues associated with 
an exhaustive exploration of a system’s states. For many 
systems, a large set of states need to be observed to eventually 
detect a counterexample or synthesize an invariant. Multi- 
phase loops [39] often exhibit such behaviour, in particular. 
A recently introduced approach based on Transition Power 
Abstraction (TPA) [5] successfully attacks the first part of the 
problem. It uses abstraction to summarize the reachability of 
an exponentially increasing number of steps. Thus TPA can 
quickly focus on the essential part of the search space and 
not waste time examining short paths that cannot lead to a 
counterexample. Interestingly, the abstractions that enable TPA 
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to detect long counterexample paths quickly can also be used 
to prove safety by discovering safe transition invariants [5]. 
However, the required condition that the over-approximating 
relation must be closed under composition with transition 
relation is rarely satisfied, and the algorithm performs rather 
poorly on safe systems. 

In this paper we leverage the ideas from TPA that enable 
a fast exploration of large parts of the state space to detect 
invariants in the system that hold only after a specific (often 
very large) number of transitions. Our new approach, called 
SPLIT-TPA, also uses the idea of the transition power ab- 
straction sequence but computes the abstractions in a different 
way that generates significantly more suitable candidates for 
transition invariants. In the original TPA sequence n® element 
over-approximates reachability up to 2” steps of the transition 
relation. The TPA sequence is used to check reachability by 
doubling the number of explored states at every iteration 
of the verification run. At the same time the sequence is 
expanded and its elements are refined as a direct consequence 
of information learned in these bounded reachability checks. 

The novelty of SPLIT-TPA lies in splitting the over- 
approximating sequence into two complementary parts: TPA~ 
and TPAS. Elements of TPA= summarize paths of a fixed 
number of steps: n'" element covers exactly 2” steps of the 
transition relation. The elements of TPA< complement the 
first sequence: n® element summarizes all paths of length 
less than 2”. The abstractions of TPAS sequence allow SPLIT- 
TPA to discover a special type of safe transition invariants, 
which are not possible to obtain in the original TPA algorithm. 
These invariants are composed of two orthogonal parts: one 
part summarizes safe transitions up to a specific bound; the 
second part summarizes unbounded safety, but only from that 
specific bound onwards. The final invariant is a disjunction of 
these two orthogonal parts which together cover any number 
of transitions. This specific structure makes these invariants 
suitable for proving safety of a large class of problems 
including some challenging instances that cannot be tackled 
by other state-of-the-art approaches. 

We have implemented SPLIT-TPA in our publicly available 
CHC solver GOLEM and compared it against the original 
TPA approach and other state-of-the-art solvers ELDARICA 
and SPACER. On a set of challenging public benchmarks 
representing multi-phase loops [39], SPLIT-TPA significantly 
outperforms TPA on the safe version of these benchmarks and 
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is able to prove safe several benchmarks that state-of-the-art 
tools are not able to solve. Moreover, SPLIT-TPA outperforms 
TPA also on the unsafe version of these benchmarks. 

The rest of the paper is organized as follows. Section II 
presents the necessary background. Section III gives a detailed 
overview of the TPA algorithm from [5]. Our novel instanti- 
ation is presented in Section IV. In Section V we show how 
the transition invariants from SPLIT-TPA can be translated into 
state invariants. The experiments are described in Section VI. 
Finally, we discuss the related work in Section VII and 
conclude in Section VIII. 


II. PRELIMINARIES 


We assume a finite set of (typed) variables 7, called state 
variables, and we associate with it a primed copy 7. A 
formula 5(Z) over the state variables is a state formula and 
a formula T(z, Z’) is a transition formula. A state s is an 
interpretation of Z that assigns value to each x € 7. For a 
formula S(7) and a state s we say s is an S-state iff s = S. 
We identify state formulas with sets of states where they hold 
and freely move between these two representations. Similarly, 
we identify transition formulas with binary relations over the 
set of states. The identity relation d(x, x’) corresponds to the 
transition formula x = x’. For readability we typically drop the 
vector notation and use x, x’ instead of Z, 7’. Additional copies 
of the state variables are denoted as x”, x”, or in general x”) 
for x with n primes added. Given binary relations Rı and 
Rə, Rı o Re represents relational composition of R; and Ro, 
R, U Ro represents their union. For R = R; o Ro, R(x, z) = 
dy : Ri(x,y) A Roly, z). Similarly, for R = Rı U Ro, 
R(a,y) = Rı(x, y) V R(x, y). For a binary relation R and a 
set A, we denote the restriction of the domain of R to A as 
AdR= {(x,y) | (x,y) € Rand z € A} and the restriction 
of codomain as Rp A = {(x,y) | (x,y) € R and y € A}. In 
terms of logical formulas, (A < R)(x, y) = R(a,y) A A(x), 
(Re A)(x,y) = R(a,y) A Aly). 

Transition system is a pair S = (Init, Tr) where Init(Z) 
defines the initial states and Tr(Z, 2’) is a defines the tran- 
sition relation of the system. A safety problem is a triple 
(Init, Tr, Bad) where (Init, Tr) is a transition system and 
Bad(Z) represents erorr states. Relation Tr” denotes the com- 
position of n copies of the transition relation and represents 
reachability in exactly n steps. Tr? = Id. 

A set of states S is a k-inductive invariant iff 

© Init(2) A Tr*(2@, 2) => S(r®) for0<i<k, 

e Ke S(a) A Tre, ztd) —> S(r), 

S is an inductive invariant if it is 1-inductive. 

A binary relation R is a (full) transition invariant iff R D 
Tr*, where Tr* is a reflexive transitive closure of Tr. We 
say that R is a left-grounded transition invariant iff Init < 
R D Init « Tr*. Similarly, R is a right-grounded transition 
invariant iff Ro Bad D Tr*> Bad. R is a grounded transition 
invariant if it is either left-grounded or right-grounded. Note 
that a full transition invariant is also both left-grounded and 
right-grounded. We say R is safe iff Vx, x’ : x € Init Na’ € 
Bad => (2,2’) ¢ R, or in other words, Init(x) A R(x, x’) A 


Bad(c’) is unsatisfiable. If a safe grounded transition invariant 
exists, then Bad is not reachable from Init, and the system is 
safe. 

A Craig interpolant [15] for an unsatisfiable A A B is a 
formula J such that i) A => J; Gi) TAB = > Ll; Gii) I 
uses only common symbols of A and B. 


III. AN OVERVIEW OF TPA 


Here we give a brief overview of the TPA algorithm as 
introduced in [5]. The main procedure is given in Algorithm 1 
and resembles the typical main loop of bounded model check- 
ing that checks bounded reachability for gradually increasing 
bound. The main difference is that TPA increases this bound 
in exponential steps (ISREACHABLE(n, Init, Bad) checks for 
paths of length <2"+1), instead of in one-step increments, 
as is typical for bounded model checking algorithms. This 
allows TPA to detect much longer counterexamples compared 
to state-of-the-art competitors, as witnessed in [5]. 


Algorithm 1: ISSAFETPA((Init, Tr, Bad)): TPA’s 
main procedure 


input : transition system S = (Init, Tr, Bad) 
global : TPA sequence ATr=°,..., ATr=",... (lazily 
initialized to true) 

1 ATr<° + IdV Tr; n<0; 
2 while res = Ý do 
3 | res + ISREACHABLE(n, Init, Bad) 
4| nen+l1 
5 return UNSAFE 


res + Ú 


The key ingredient that allows efficient bounded reachability 
checks is the transition power abstraction sequence. It is 
a sequence of relations where n® element over-approximates 
reachability in up to 2” steps of Tr. The construction and 
refinement of the TPA sequence happen as part of the bounded 
reachability check, inside the procedure ISREACHABLE, given 
in Algorithm 2. 


Algorithm 2: ISREACHABLE(n, Src, Tgt): Reachabil- 
ity query using TPA sequence 


input : level n, source states Src, target states Tgt 
output: subset of target states truly reachable in <2”™ steps 
global : TPA sequence ATr=°,..., ATr=”,... 
1 while true do 
2 | q4 Sre(x) ^n ATrS" (x, 2’) A ATrS"(2’, £) A Tgt(a") 
3 | sat_res + CHECKSAT(q) 
4 | if sat_res = UNSAT then 
5 


Itp(x, x") 4+ GETITP(ATrS" (x, 2’) A ATrS”(x', 2”), 
Sre(x) A Tgt(x”)) 
6 ATr St! & ATrS"* A Htp|e” => x'] 
7 return () 
8 | else 
9 if n = 0 then return QE(Az, 2’ q)[x” > x] 
10 Inter — QE(Az, x” .q)[x" > a] 
11 InterReach + ISREACHABLE(n — 1, Src, Inter) 
12 if InterReach = Ú then continue 
13 TgtReach ~ ISREACHABLE(n — 1, InterReach, Tgt) 
14 if TgtReach # Ú then return TgtReach 
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This procedure returns a subset of reachable states of Tgt 
if there exists a path from Src to Tgt of length at most 
2n+1, Tf no such path exists, it returns an empty set. First, 
it checks existence of an abstract path consisting of two steps 
of ATr<", the n™ element of the TPA sequence (lines 2-3). 
If no such abstract path exists (line 4), then no real path of 
length <2”+! exists (line 7). Additionally, n + 1% element 
of the TPA sequence is constructed or refined using Craig 
interpolation [15] (lines 5-6). 

If an abstract path does exist, the procedure attempts to 
refine it to a real path. The refinement begins by applying 
quantifier elimination (QE) to determine a set of candidate 
intermediate states (line 10). These are states that can be 
reached from Src by one step of ATr<" and also can reach 
Tgt by another step of ATr<”. Given a set of intermediate 
states, the procedure recursively determines the existence of a 
real path from Src to the intermediate states (line 11) and 
then the existence of a real path from the truly reachable 
intermediate states to Tgt (line 13). The bound for these 
recursive calls is decremented, and n = 0 represents the 
base case where no recursive calls are needed as ATr<° 
represents true reachability in the system (line 9). If any of 
the two abstract steps cannot be refined, the procedure tries to 
find a new abstract path and repeats the whole process. The 
strengthening of ATr=" in the recursive call to ISREACHABLE 
with n— 1 guarantees that refuted abstract paths cannot repeat, 
and the procedure makes progress. 

Note that instead of full quantifier elimination, any under- 
approximation can be used in ISREACHABLE. In particular, 
experiments in [5] showed that TPA works much better with 
model-based projection [4, 30]. 

One way to understand the procedure ISREACHABLE in 
TPA is that it mimics bounded reachability checks using a 
sequence of (precise) relations RS” defined inductively as 


R®? = IdU Tr, 
Rent = RS” o RE”. 


(1) 


However, this precise sequence is over-approximated by the 
TPA sequence. The over-approximation keeps the satisfiability 
queries manageable: Each ATr<" is a formula only over 
two copies of the state variables, no matter how large n is. 
This is guaranteed by using Craig interpolation to compute 
the abstractions. Compared to that, representing relation RS” 
precisely requires 2” + 1 copies of the state variables. 

The TPA algorithm has been designed to detect long coun- 
terexample paths quickly and in this has achieved significant 
improvements over the state-of-the-art. Interestingly, the TPA 
sequence can also provide candidates for safe transition in- 
variant, which could be used to prove safety. However, the 
capabilities of TPA in this respect are very limited, as also 
exhibited by the experimentation in [5]. 

Fig. 1 illustrates the limitations of TPA in generating safe 
transition invariants. The loop on the left has been studied 
extensively in the context of loop invariants, e.g., in [39]. 
We scaled the constants to better demonstrate the behaviour 
of TPA. TPA proves safety up to 8192 = 21° iterations 


v=0; w=0; 
assume (x>z); 


x=0; y=5000; while (v<1000) { 
while (x<10000) { if (x<z) 
if (x>=5000) v=v+1; 
y=y+1 else 
x=x+1; w=wt1; 
} x=xt1; 
assert (y==10000); Z=Zt+2; 


} 
assert (w>0); 


Fig. 1. Examples of multi-phase loops 


of the loop very quickly. Each of the first 13 top-level 
calls to ISREACHABLE determines bounded safety with a 
single satisfiability query. In the process, TPA learns that 
ATr<”" = a! < x +2” for n = 1...13. It utilizes the fact 
that x must be incremented more than 213 times to exit the 
loop and reach the assert. However, in the next iteration 
of Algorithm 1 an abstract path of two steps of ATrs8 
is discovered and the refinement process in ISREACHABLE 
begins. To make progress, the algorithm must refine the over- 
approximating relation ATr=}8 in order to show that the error 
is not reachable in two steps of ATr="8. This requires learning 
a suitable relation between variables x and y. However, since 
ATrS! must capture all paths of length <2!%, it is not easy 
to learn such a relation. At least in our implementation, TPA 
is continuously discovering and refuting new abstract paths, 
making very little progress in refining the elements of the TPA 
sequence with each refutation. Due to this slow progress, the 
algorithm fails to prove safety in a reasonable amount of time. 

The second loop depicted on the right of Fig. 1 is benchmark 
17 from the suite of multi-phase benchmarks used in our 
experiments (Section VI). The behaviour of TPA is similar to 
the previous case, but this time it can find a safe invariant, 
though at a considerable cost, as illustrated below. It uses 
variable v and the fact that at least 1000 increments are 
required and quickly proves bounded safety up to 2° iterations 
of the loop. In the next iteration of its main procedure TPA 
spends a considerable amount of time in ISREACHABLE 
refining the abstraction and capturing the behaviour of the 
other variables and the relations between them. Finally, after 
proving safety up to 2! iterations of the loop, it manages to 
discover a safe transition invariant. 

We will see in the next section that SPLIT-TPA is able 
to prove the first loop safe and it can find a safe transition 
invariant for the second loop much faster. 


IV. SPLIT TRANSITION POWER ABSTRACTION 


In this section we present SPLIT-TPA, a new instantiation of 
the TPA approach suitable for proving unbounded safety. We 
start by revisiting RS from Eq. (1) and show that the idea of 
splitting the TPA sequence arises naturally from a redundancy 
present in the inductive definition of RS. Then we show how 
SPLIT-TPA performs bounded reachability checks with the 
split sequences and how it discovers safe transition invariants. 
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A. Overview 


As mentioned previously, the TPA algorithm has been 
designed to be a simple and efficient procedure for detecting 
deep counterexample paths. It can also prove safety by dis- 
covering a safe transition invariant for the system. However, 
the only source of candidates for the required safe transition 
invariants are the elements ATr<” of the TPA sequence. 
ATr=" can be proved to be a transition invariant if it is 
closed under composition with one step of Tr. The problem is 
that this condition is rarely fulfilled. The abstractions ATr=” 
are primarily constructed as proofs of bounded safety in the 
system: they must summarize all paths of lengths from 0 to 2” 
and they must be safe. While it is possible that such bounded 
proof is in fact an unbounded proof, in many cases these 
abstractions are not closed under composition with Tr, and 
the bounded proofs do not generalize to unbounded proofs. 

Our solution to TPA’s lack of ability to prove unbounded 
safety in practice is to introduce new source of candidates 
for transition invariants. We split the over-approximating TPA 
sequence into two complementary parts: TPA~ and TPAS. 
Elements of TPA~ summarize paths of fixed length and the 
corresponding elements of TPA< summarize all shorter paths. 
While TPA< leads to similar transition invariants as TPA, 
TPAS leads to invariants with different structure and different 
properties, which allows SPLIT-TPA to prove safety of some 
challenging problems. 

The idea of splitting is motivated not only by the need 
for another source of candidates for invariants, but also by a 
possible redundancy in the TPA algorithm, which could lead 
to unnecessary work. TPA sequence is based on the sequence 
RS from Eq. (1). The intuition behind this inductive definition 
is that every path of length <2”+! can be obtained as a 
concatenation of two paths of length <2”. However, there 
can be multiple ways to decompose such a path into two 
smaller paths (see Fig. 2) and proving one such decomposition 
infeasible does not entail that others are infeasible as well. 


Fig. 2. Three different ways of decomposing path of length 6 into two paths 
of length at most 4 


Splitting arises naturally from an attempt to fix this redun- 
dancy. The reasoning is as follows: Instead of concatenating 
two steps of RS” to obtain RS"+1, we replace one of these 
steps with a step of R™” = Tr? , which represents reacha- 
bility in exactly 2” steps. However, RS” o R=" covers only 
paths of length from 2” to 2”+!. To keep the smaller lengths 
covered as well, we can add RS”. The result, RS”! = 
RS” U RS” o R=", almost gives us the unique deconstruction 
we are seeking. The exceptions are paths of length exactly 2” 


which are covered by both RS” and RS” o R=". The final 
step is a realization that this last redundancy is removed by 
replacing the relation RS” by R<”. The sequence R< has the 
following inductive definition:! 


R<° = Id, 


2 
RSH! = RS” U RS” o R=”, ( ) 
with the sequence R= also defined inductively: 
R™ = Tr, 
(3) 


ae R™ o R™. 


Notice that we have effectively split the RS sequence into 
two sequences R< and R=, because RS” = R<"UR=". Now, 
decomposing a path according to the inductive definitions from 
Eq. (2) and (3) is unique. For example, there is only one way 
to decompose the path of length six from Fig. 2, now viewed 
as one step of R<°, according to Eq. (2): first two steps are 
covered by R<? and the last four steps are covered by R=?. 

Following the TPA template, we do not use the sequences 
RS and R= directly. We build over-approximating sequences 
TPA< and TPA> whose representation in terms of copies 
of state variables does not blow up with increasing n. The 
elements of the over-approximating sequences TPA< and 
TPA= are denoted as ATr<" and ATr~", respectively, and 
we require that 


ATr<” D R” = Id U TrU Tr?U+--U Tr, (4) 
ATr=" D R=" = Tr”. (5) 


SPLIT-TPA uses these over-approximating sequences TPA < 
and TPA~ both for bounded reachability checks and for 
detecting safe transition invariants. We will see later that TPA~ 
sequence allows SPLIT-TPA to find interesting invariants and 
prove safety of challenging problems. The main procedure 
of SPLIT-TPA is similar to Algorithm 1 and is given in 
Algorithm 3. 


Algorithm 3: ISSAFESPLITTPA((Init, Tr, Bad)): 
SPLIT-TPA’s main procedure 


input : transition system S = (Init, Tr, Bad) 
global : TPAS sequence ATr<°,..., ATr<",... 
TPA> sequence ATr~°,..., ATr=",... (lazily 
initialized to true) 
1 ATr<° + Id; ATr=° + Tr; 
2 while true do 
3 | if ISREACHABLELT(n, Init, Bad) 4 0 or 
ISREACHABLEEQ(n, Init, Bad) 4 Ý then return 
UNSAFE 
4 | if HASTRANSITIONINVARIANT(S, 7) then return SAFE 
5| n¢en+l1 


n0 


In the rest of this section we present the implementa- 
tion of the methods ISREACHABLELT and ISREACHABLEEQ 
for bounded reachability checks and the implementation of 
the method HASTRANSITIONINVARIANT for discovering safe 
transition invariant. 


l! An alternative inductive definition R<"+! = R<” U R=" o R<” leads 
to a different variant of our algorithm. 
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B. Bounded reachability checks with TPA~ and TPAS 


SPLIT-TPA performs the bounded reachability check at level 
n in two phases. First, all paths of length strictly smaller 
than 2”+1 are checked in ISREACHABLELT. Then all paths 
of length exactly 2"*1 are checked in ISREACHABLEEQ. 

To implement ISREACHABLEEQ, we can reuse Algo- 
rithm 2, with the modification that all references to TPA 
sequence and its elements ATr=" are replaced by TPA 
sequence and its elements ATr~” (we do not repeat the 
pseudocode for the sake of space). To understand why this 
works, compare the inductive definitions of the underlying 
sequences RS and R= from Eq. (1) and (3). The induction step 
is the same in both cases. The only difference is the base case: 
TPA= sequence starts with ATr~° = R=° = Tr, as opposed 
to ATr<° = RS° = IdU Tr. The output of ISREACHABLEEQ 
is either a non-empty subset of Tgt that is truly reachable from 
Sre in exactly 2”+! steps of Tr, or an empty set if no path 
from Src to Tgt of length 2”+! exists. 

The procedure ISREACHABLELT is designed to comple- 
ment ISREACHABLEEQ by covering all paths with <2”+! 
steps. The implementation is given in Algorithm 4. It follows 
the inductive definition of R< from Eq. (2) in the same manner 
as ISREACHABLEEQ follows the inductive definition of RS. 


Algorithm 4: Reachability query using TPAS se- 
quence 
input : level n, source states Src, target states Tgt 
output: subset of target states truly reachable in <2"*" steps 
global : TPAS sequence ATr<°,..., ATr<”,..., 
TPA= sequence ATr=°,..., ATr=",... 
while true do 


1 
2 | optl + ATr<"[a’ => x” 
3 | opt2 4+ ATr<"(a,2') A ATr="(2’, x") 
4 | q4 Sre(x) A (optl V opt2) A Tgt(x”) 
s | sat_res, model + CHECKSAT(q) 
6 | if sat_res = UNSAT then 
7 Itp(x, x”) + GETITP(optl V opt2, Sre(x) A Tgt(x"’)) 
8 ATr<"*! & ATr<"t! A Itp[a” = 2] 
9 return () 
10 | else 
1 if n = 0 then return QE(Az, 2’ : q)[£” > <] 
12 if model — opt1 then 
13 TgtReach + ISREACHABLELT(n—1, Src, Tgt) 
14 if TgtReach = Ú then continue 
15 return TgtReach 
16 else 
17 Inter — QE(Az, x” : 
Sre(x) A opt2 A Tgt(x”), x )[£" + x] 
18 InterReach < ISREACHABLELT(n—1, Src, Inter) 
19 if InterReach = Ú then continue 
20 TgtReach <— 
ISREACHABLEEQ(n—1, InterReach, Tgt) 
21 if TgtReach = then continue 
22 return TgtReach 


ISREACHABLELT first assembles the query for an abstract 
path (lines 2—4) and sends it to the satisfiability solver (line 5). 
Following the inductive definition of Eq. (2), the abstract path 
consists of either one step of ATr<” or a step of ATr<" 


followed by a step of ATr~”. If no such abstract path exists 
(line 6), the procedure reports that no real path of length 
<2"*! exists (line 9). Before reporting the result, it uses Craig 
interpolation [15] to refine the abstraction at the next level 
(line 8). 

If an abstract path exists (line 10), the procedure checks 
whether there is a corresponding real path. On level 0 (line 11), 
the discovered abstract path is real, and the procedure returns a 
reachable subset of target states. On other levels, the procedure 
first needs to determine which abstract path has been found 
and then try to refine it. 

The first possibility is that the abstract path is a single step 
of ATr<” (line 12). The refinement of this single abstract 
step is checked with a single recursive call. If the refinement 
is not successful, the procedure attempts to find a new abstract 
path (line 14). Otherwise, the reached target states from the 
recursive call are returned (line 15). 

The second possibility is that abstract path consists of one 
step of ATr<” followed by one step of ATr~” (line 16). 
One after another, the procedure attempts to refine these 
abstract steps into a real path by calling the corresponding 
procedures ISREACHABLELT and ISREACHABLEEQ with de- 
creased bound. If any of the two steps cannot be refined, that 
abstract path has been refuted and the procedure attempts to 
find a new abstract path (lines 19, 21). If both abstract steps 
have been successfully refined, a reachable subset of target 
states is reported (line 22). 

Similarly to Algorithm 2, quantifier elimination can be 
replaced by its under-approximation, such as model-based 
projection, and we do so in our implementation. 

The correctness of the reachability procedures guarantees 
the correctness of UNSAFE answer of SPLIT-TPA. 

Lemma 1: If ISREACHABLEEQ(n, Src, Tgt) or ISREACH- 
ABLELT(n, Src, Tgt) returns a non-empty set Res, then Res C 
Tgt and every state in Res can be reached from some state in 
Src in exactly 2+! steps (for ISREACHABLEEQ) or in <2"+1 
steps (for ISREACHABLELT). 

Proof: By induction on n, relying on the properties of 
quantifier elimination (QE) and the fact that ATr<° = Id and 
ATr~° = Tr represent true reachability. i 

Theorem 1: If SPLIT-TPA (Algorithm 3) returns UNSAFE, 
then there exists a counterexample path in the system, i.e., 
some bad state is reachable from some initial state. 

Proof: Follows directly from Lemma 1. a 


C. Proving safety by discovering safe transition invariants 


If a bounded safety has been proved on level n in Al- 
gorithm 3, i.e., there is no counterexample path of length 
<2”+1 in the system, then the algorithm attempts to ex- 
tend the bounded proofs to unbounded ones. The procedure 
HASTRANSITIONINVARIANT tries to construct a (grounded) 
transition invariant based on the elements of TPA~ and TPA< 
sequences. If a safe transition invariant is found, SPLIT-TPA 
has proven unbounded safety. 

We have identified sufficient conditions for the elements 
ATr<" and ATr=" that guarantee the existence of a transition 
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invariant. These conditions are formalized in Lemma 2 and 
Lemma 3, respectively. 

Lemma 2: Assume that for some n, Init 4 ATr<”" o Tr C 
Init 4 ATr<". Then ATr<" is a left-grounded transition 
invariant. 

If Tro ATr<" ò Bad C ATr<" > Bad, then ATr<” is a 
right-grounded transition invariant. 

Proof: Suppose that s € Init and (s,t) € Tr”, i.e., t is 
reachable from s. We show that (s,t) € ATr<” by induction 
on d, the length of minimal path from s to t. 

Base case d < 2": (s,t) € ATr<” holds by Eq. (4). 

Induction step: Suppose that the claim holds for all paths 
of length d. We show that then it also holds for all paths of 
length d+ 1. Consider a path between s and t of length d+1. 
Then t has a predecessor m on this path, i.e., m lies d steps 
from s and reaches t in 1 step. Then (s,m) € ATr<" by the 
induction hypothesis. Since (m, t) € Tr it follows that (s, t) € 
ATr<" o Tr. Since s € Init, it follows by the assumption of 
the lemma that (s,t) € ATr<”. 

We have shown that if s € Init and (s,t) € Tr” then 
(s,t) € ATr<". Thus A7r<" is a left-grounded transition 
invariant. The case of the right-grounded transition invariant is 
analogous. In the inductive case, we consider m the successor 
of s on the path from s to t. a 

Note that with a slightly stronger assumption we can use 
the same proof idea to discover full transition invariants: 

Observation 1: If ATr<" o Tr C ATr<" or Tro ATr<” C 
ATr<" then ATr<" is a transition invariant. 

Discovering transition invariants based on TPAS sequence 
is similar to how the invariants were detected in TPA sequence 
in [5]. This is not surprising, as the elements ATr<” and 
ATr=" have similar properties. The key advantage of SPLIT- 
TPA is the additional ability to discover transition invariants 
by detecting fixed points in the TPAS sequence. 

Lemma 3: Assume that for some n, Init < 
ATr<"0 ATr™ o ATr™ C Init « ATr<" o ATr™ 
then Init < Tr* C Init < ATr<" U ATr<" o ATr™. 

If ATr—" o ATr™ o ATr<" ò Bad C ATr™ o ATr<" > 
Bad then Tr* > Bad C ATr<" U ATr™ o ATr<" > Bad. 

Proof: The proof uses the same ideas as the proof of 
Lemma 2. Suppose that s € Init and (s,t) € Tr*, i.e., t is 
reachable from s. We proceed by induction on d, the length 
of minimal path from s to t. 

Base case d < 2”*1; It follows by Eq. (4) and (5) that 
(s,t) € ATr<” U ATr<" o ATr™. 

Induction step: Assuming the claim holds for all paths of 
length d, we show that it also holds for all paths of length 
d + 2”. Consider a path between s and t of length d + 2”. 
There exists m on this path that lies d steps from s and reaches 
t in exactly 2” steps. Then (m, t) € ATr~” by Eq. (5) and 
(s,m) € ATr<”" U ATr<" o ATr—" by induction hypothesis. 
Consider the two cases: 

e (s,m) € ATr<": It follows that (s, t) € ATrS”o ATr™™. 

e (s,m) € ATr<” o ATr™: It follows that (s,t) € 

ATr<"0 ATr™™ o ATr™. Then (s, t) € ATr<"0o ATr™ 

by the assumption of the lemma. 


We have shown that if s € Init and (s,t) € Tr* then (s,t) € 
ATr<" U ATr<" o ATr=". Thus ATr<” U ATr<” o ATr™” 
is a left-grounded transition invariant. For the right-grounded 
transition invariant, in the induction step pick m that lies 
exactly 2” steps from s (and reaches Bad in d steps). a 

Similarly to Lemma 2, full transition invariants can be 
discovered by checking a stronger condition: 

Observation 2: If ATr~" o ATr—" C ATr™™ then both 
ATr<"U ATr<”" o ATr— and ATr<"U ATr— o ATr<" are 
full transition invariants. 

Note that transition invariants obtained using Lemma 3 are 
disjunctive by definition. The disjunctive structure reflects the 
inductive nature of the proof of Lemma 3. A7r<” corresponds 
to the base case and represents the bounded part of the proof; 
ATr—" corresponds to the induction step and represents the 
unbounded part of the proof. Since the induction step makes 
2” steps of Tr instead of 1, the unbounded proof corresponds 
to k-induction rather than induction. 

The procedure HASTRANSITIONINVARIANT checks the 
conditions of Lemma 2 and Lemma 3 using an SMT 
solver. For example, ATr="0 ATr=" o ATr<” > Bad C 
ATr— o ATr<” > Bad iff ATr—"(z, x’) A ATr—"(2', x") A 
ATr<"(2", x") \ Bad(a!") AA ATr=" (x, x") is unsatisfiable. 
When the procedure discovers a grounded transition invariant 
it must also verify that the invariant is safe, i.e., it does not 
relate any initial with any bad state. This can also be checked 
with a single satisfiability query. In the case of transition 
invariant detected using conditions of Lemma 2, the check is 
not even necessary. The invariant, which is ATr<" for some 
n, iS guaranteed to be safe after ISREACHABLELT proved 
bounded safety on level n — 1. 

The detection of safe (grounded) transition invariants as 
described above allows SPLIT-TPA to prove safety and the 
correctness is guaranteed by Lemma 2 and Lemma 3. 

Theorem 2: If SPLIT-TPA returns SAFE, there is no coun- 
terexample path from Init to Bad in S. 

To demonstrate the behaviour of SPLIT-TPA, recall the 
loops from Fig. 1. For the first loop, similarly to TPA, 
SPLIT-TPA quickly proves bounded safety up to 8192 = 213 
iterations of the loop, and in the process learns that A7r<" = 
xv’ < x+2” and that ATr™” = x’ < x +2” forn=1...13. 
In the next iteration of its main loop, SPLIT-TPA discovers 
an abstract path consisting of a step of ATr<'* followed by 
a step of ATr='°. After some time spent in the refinement, 
the algorithm manages to refute all abstract paths and proves 
bounded safety for < 214 iterations. As part of the refinement, 
it strengthens ATr='° to include the facts z’ = x + 8192 
and x < 1808. With this strengthened information, it can 
easily prove that no path of length exactly 214 = 16384 exists 
because it is not possible to make two steps of the abstract 
relation ATr=!° from the initial state. In addition, it learns 
that ATr~'* = x < —6384. This satisfies the condition of 
Observation 2, namely ATr=4 o ATr=4 C ATr="4. Thus 
SPLIT-TPA concludes at this point that the system is safe. 

When analyzing the second loop, SPLIT-TPA behaves dif- 
ferently than TPA. After proving bounded safety in the first 
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iteration of Algorithm 3, SPLIT-TPA learns that ATr=' 
xz >z = w > w+ 2. In the next iteration, ATr7 
is strengthened with facts 7 > z => w > w4+1 and 
z <z => w > w. These three facts together concisely 
over-approximate the change to w after precisely two iterations 
of the loop. Moreover, ATr=! with these three components is 
closed under composition, i.e., ATr=! o ATr= C ATra!. 
Thus, SPLIT-TPA concludes already at this point that the sys- 
tem is safe (based on Observation 2). The transition invariant, 
using @ = (x, z, v, w), is then ATr<' (a, @’)V(ATr<' (ë, @)A 
ATr='(a,@”)), where 


= 


ATr<'(4,@) =w >wAv' <vA 
(a >an <z) V(t >r+1Az <2+2)), 


AT (ad) =r>z>w >wt2a 
r>z>w >w++1l^ 


r<z=>w >w. 


Note that the exact value of ATr<' is not important in this 
case, as long as it over-approximates all paths of length <2. 


V. FROM TRANSITION INVARIANTS TO STATE INVARIANTS 


In Section IV-C, we have shown how SPLIT-TPA can prove 
a transition system safe by finding a safe transition invariant. 
However, many applications require a proof of safety in the 
form of a safe inductive (state) invariant. Here we show that 
(k-)inductive invariants can be obtained from the discovered 
transition invariants by quantifying over the source or target 
states. This follows Lemma 2 and Lemma 3 and their proofs. 

Lemma 4: Assume that for some n, the following holds: 


Init 4 ATr<” o Tr C Init 4 ATr<”. 
Then the following is an inductive invariant: 


Inv(2') = Sa : Init(x) A ATrS”(a, 2’). 


Proof: Analogous to the proof of Lemma 2. Intuitively, 

Inv represents all states reachable by one step of ATr<" from 

Init. Since ATr<" is a left-grounded transition invariant by 

Lemma 2, making one additional step of Tr cannot end up 

outside this set. Also, Init C Inv, because Id C AIS. i.e., 

Inv holds in the initial states. a 
Lemma 5: Assume that for some n, the following holds: 


Tr o ATr<" >ò Bad C ATr<" > Bad. 
If ATr<" is safe, then the following is an inductive invariant: 


Inv(x) = ~(3x' : ATrS” (x, x") A Bad(z')). 


Proof: Analogous to the proof of Lemma 4. E 

Compared to Lemma 2, the proof of Lemma 3 uses an 

inductive step of size 2”. Following that proof we can turn the 
transition invariant from TPAS into 2”-inductive invariant. 
Lemma 6: Assume that for some n, the following holds: 


Init < ATr<" o ATr™ o ATr™ C Init <a ATr<" o ATr™. 


Then the following is 2”-inductive invariant: 


Inv(x") = 3x, x : Init(x)A 
(ATr<® (x, 2") V (ATr<"(a, 2’) A ATr—"(2’, 2”))). 


Proof: We follow the proof of Lemma 3. Inv represents 
the set of states reachable from Init either by one step of 
ATr<" or by a combined step of ATr<” and ATr=". It 
follows that Inv over-approximates the set of states reachable 
from Init in less than 2”*! steps of Tr. Thus, Inv satisfies 
the base step of k-induction (for k = 2”). 

For the inductive step, we need to prove that making 2” 
steps of Tr from an Jnv-state leads again to an Inv-state. We 
rely on Eq. (5), i.e., ATr™” D Tr?” If s is an Inv-state, then 
it is reachable from some initial state 2 either in one step of 
ATr<" or in one step of ATr<” o ATr—". Moreover, all states 
reachable from s in 2” steps of Tr are reachable from s by 
one step of ATr~”. Thus, in the first case, they are reachable 
from i in one step of ATr<” o ATr—". In the second case, they 
are reachable from i in one step of ATr<"0 ATr™” o ATr—”. 
Based on the assumption of the lemma, they are reachable 
from i also in one step of ATr<" o ATr™”. E 

Lemma 7: Assume that for some n, the following holds: 


ATr— o ATr™ o ATr<" ò Bad C ATr™” o ATr<" > Bad. 


If ATr<"(a, 2") V (ATr™” (a, 2') A ATr<"(2', 2"’)) is safe 
then the following is 2”-inductive invariant: 


Inv(x) = 7(Se', x” : Bad(x")A 
(ATr<"(@, 2") V (ATr-" (a, 2’) A ATr<"(a’,2”)))). 


Proof: Analogous to the proof of Lemma 6. a 
Note that in each given case, the (k-)inductive invariants 
are quantified and quantifier elimination must be applied 
if quantifier-free inductive invariants are required. Inductive 
invariants can be obtained from k-inductive invariants by 
quantifying over the intermediate states [29]. 


VI. EXPERIMENTS 


We have implemented SPLIT-TPA in our Horn solver 
GOLEM’. In our experiments we used GOLEM 0.1.0, which 
uses OPENSMT 2.3.2 for SMT solving and interpolation. 

The goal of the experiments was to compare SPLIT-TPA 
to TPA [5], which is also available in GOLEM, and to 
state-of-the-art tools ELDARICA 2.0.8 [26], Z3-SPACER [30] 
implemented in Z3 4.8.17 [35], and GSPACER [22] a more 
recent version enriched with global guidance. All experiments 
were conducted on a machine with AMD EPYC 7452 32-core 
processor and 8x32 GiB of memory. We used a timeout of 5 
minutes for every task.? 

For the evaluation we used the set of benchmarks represent- 
ing multi-phase loops [39], which are known to be challenging 
for automated analysis techniques. We used both the safe 


2https://github.com/usi-verification-and-security/golem.git 
3Full results at http://verify.inf.usi.ch/content/split-tpa-experiments, artifact 
at https://doi.org/10.528 1/zenodo.6988735 
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TABLE I 
SUMMARY OF THE EXPERIMENTS ON MULTI-PHASE BENCHMARKS. 


SPLIT-TPA| TPA |Z3SPACER|GSPACER|/ELDARICA 
19 (7) |12 ©) 6 (0) 24 (3) 26 (4) 
37 (3) 135 (2)| 20 (0) 17 (0) 17 (0) 

Solved (unique) instances out of 54 benchmarks. 


Benchmark suite 


multi-phase safe 
multi-phase unsafe 


TABLE II 
FULL RESULTS ON SAFE (LEFT) AND UNSAFE BENCHMARKS (RIGHT) 


[Ben./SPLIT-TPA] TPA /Z3SPACER|GSPACERIELDARICA| [Ben./SPLIT-TPA] TPA |Z3SPACER|GSPACERIELDARICA| 


0l 26.28 TO TO TO TO Ol 14.53 | 10.12 TO TO TO 
02 TO TO 133.28 <1 TO 02 <1 <1 1.25 TO TO 
03 TO TO TO TO 1.33 03 <1 <1 <1 <l 1.16 
04 TO TO TO <1 3.70 04 TO TO TO TO TO 
05 <1 <l <1 <1 1.19 05 <1 <1 <l xi 1.18 
06 TO TO TO TO 3.95 06 TO TO TO TO TO 
07 TO TO TO <l 1.32 07 TO TO TO TO TO 
08 TO TO TO TO TO 08 TO TO TO TO TO 
09 TO TO TO TO TO 09 TO TO TO TO TO 
10 TO TO TO TO TO 10 | 20.40 |233.78) TO TO TO 
11 TO TO TO 5.68 TO 11 | 152.28 | TO TO TO TO 
12 TO TO TO TO 1.62 12 TO TO TO TO TO 
13 <1 <l ERR <1 1.16 13 <1 <1 <1 <1 1.13 
14 | 53.94 TO TO TO 118.78 14 <1 <1 <l 8.91 89.78 
15 TO TO TO TO TO 15 TO TO TO TO TO 
16 TO TO TO TO TO 16 TO TO TO TO TO 
17 <1 37.50 TO <1 7.53 17 14.84 | 15.81] 181.59 TO TO 
18 <1 <l TO <1 3.66 18 <1 <l <l xi 1.57 
19 TO TO <l <1 1.22 19 <1 <1 <l <l 20.74 
20 TO TO TO TO TO 20 TO TO TO TO TO 
21 <1 10.39 TO <1 15.45 21 <1 <1 <1 <1 10.63 
22 TO TO TO TO TO 22 TO TO TO TO TO 
23 <1 <1 ERR <1 1:79, 23 <1 <1 <l <1 1.17 
24 TO TO TO TO TO 24 <1 TO 96.64 TO TO 
25 TO 45.93 TO TO 9.33 25 <1 <1 <l <l 1.19 
26 2.60 1.55 TO <1 TO 26 2.01 1.46 TO TO TO 
27 TO TO TO TO TO 27 <1 <1 TO TO TO 
28 <l TO TO TO 1.61 28 <1 <1 TO TO 162.43 
29 3.94 TO TO 118.98 34.22 29 <1 <1 2.76 32.56 45.75 
30 TO TO TO TO 20.48 30 <1 <1 <l <1 10.22 
31 TO TO TO <1 1.60 31 TO TO TO TO TO 
32 TO TO TO 11.49 TO 32 <1 <1 <l <1 TAT 
33 TO TO TO TO TO 33 <1 <1 <l <1 1.21 
34 TO TO TO <1 5.86 34 <1 <1 <1 <1 1.15 
35 TO TO TO <1 1.80 35 <1 <1 <l <1 1.20 
36 <1 <l TO <1 1.92 36 16.68 | 14.45 TO TO TO 
37 <1 <1 <l <1 14.33 37 <1 <1 <l <1 13.37 
38 TO <l TO <1 1.36 38 | 262.18 | TO TO TO TO 
39 TO TO 67.41 58.73 2.48 39 TO TO TO ERR TO 
40 | 109.05 | TO TO TO ERR 40 <1 <1 <l 133.07 ERR 
41 TO TO TO TO TO 41 TO 4.60 TO TO TO 
42 TO TO TO <1 4.37 42 18.31 | 40.39 TO TO TO 
43 TO TO TO 5.20 TO 43 TO TO TO TO TO 
44 TO TO TO TO TO 44| 34.18 TO TO TO TO 
45 TO TO TO TO TO 45 TO TO TO TO TO 
46 TO [|288.20| 13.07 <1 1.28 46 TO {239.05} TO TO TO 
47 TO TO TO TO TO 47 5.71 6.79 TO TO TO 
48 | 47.00 TO TO TO TO 48 17.52 | 12.10 TO TO TO 
49 | 122.96 | TO TO TO TO 49 | 32.59 | 12.49 TO TO TO 
50 TO TO TO TO TO 50 TO TO TO TO TO 
51 TO TO TO TO TO 51 6.71 11.57 TO TO TO 
52 | 235.24 | TO TO TO TO 52| 70.83 | 82.43 TO TO TO 
53 | 147.28 | TO TO TO TO 53 | 57.42 | 33.00 TO TO TO 
54| 133.63 | TO TO TO TO 54| 40.74 | 15.15 TO TO TO 


TO: timeout; ERR: memory out or other inconclusive answer. 


versions of the benchmarks from CHC-COMP repository* and 
the unsafe versions of the benchmarks from [5]. The results 
are summarized in Table I and times for each tool/benchmark 
pair are given in Table II. 

Regarding safety, Table I shows that SPLIT-TPA overall 
solved 7 more benchmarks than TPA, but still less than 
GSPACER or ELDARICA. However, it solved seven bench- 
marks uniquely (the other competitors did not solve them). 
This indicates that SPLIT-TPA is quite orthogonal to the 
existing techniques for proving safety. 

The results on unsafe benchmarks show that SPLIT-TPA 
not only preserves the capability of TPA to detect deep 
counterexample, but it was even able to outperform it by 
solving two more benchmarks overall. 


“https://github.com/chc-comp/aeval-benchmarks 


Besides the multi-phase benchmarks, we also evaluated the 
tools on a general benchmark set from the LRA-TS category of 
CHC-COMP 2021, the latest edition with a publicly available 
selected benchmark set.> Out of 498 benchmarks, SPLIT-TPA 
proved 128 benchmarks safe and 72 unsafe. TPA proved 62 
benchmarks safe and 71 unsafe. Even though the performance 
of SPLIT-TPA still lacks behind Z3-SPACER and GSPACER 
(ELDARICA does not support arithmetic over reals) on CHC- 
COMP benchmarks, it still achieved a significant improvement 
over TPA, especially on safe benchmarks. 

To better understand the advantage of SPLIT-TPA over 
TPA, we collected statistics from the runs of SPLIT-TPA on 
safe instances to see which transition invariants it used to 
prove safety. In our implementation TPA“ is checked before 
TPA~. Moreover, each sequence element is first checked for 
a full transition invariant. This is followed by checks for left- 
grounded and finally right-grounded transition invariant. 

On CHC-COMP2021 LRA-TS benchmarks, out of 128 
benchmarks proven safe, 63 invariants were discovered from 
TPA< and 65 invariants were discovered with TPA~. Regard- 
less of the sequence, 81 were full transition invariants and 
47 were left-grounded transition invariants. Surprisingly, no 
(purely) right-grounded transition invariants were discovered. 
For safe multi-phase benchmarks the results were similar. Out 
of 19 invariants, 15 invariants were found with TPAS and 4 
invariants were found with TPAS. Fifteen of these invariants 
were full transition invariants and 4 were left-grounded. Again, 
no purely right-grounded transition invariant was found. These 
statistics confirm the essential role of the TPAS sequence in 
SPLIT-TPA as a source of transition invariants. 


VII. RELATED WORK 


Many model-checking algorithms search for a safe inductive 
invariant to prove safety. Candidates for inductive invariants 
are typically obtained from proofs of bounded safety. The 
algorithms try to construct the safe inductive invariant either in 
monolithic [32, 34, 38] or incremental way [8, 10, 17, 24, 30]. 
Our work follows a similar strategy, but it primarily computes 
transition invariants, not state invariants. 

Transition invariants have been introduced in [36] as a proof 
rule for program verification, especially termination and other 
liveness properties. Transition predicate abstraction [37] has 
been introduced as a way to compute transition invariants. 
In contrast, we use transition invariants to prove safety, with 
candidates automatically obtained from proofs of bounded 
safety using Craig interpolation. 

Craig interpolation [15] is a popular abstraction technique 
widely used in model checking. We use standard algorithms to 
compute interpolants from proofs of unsatisfiability [6, 13, 33]. 
The integration of domain-specific knowledge [31] is future 
work. 

While in most model checking algorithms interpolants are 
used as over-approximations of states, we use them to over- 
approximate transitions. The idea of abstracting transition 


Shttps://github.com/chc-comp/chc-comp? 1-benchmarks/tree/main/LRA-TS 
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relation with interpolants originates from [28]. However, they 
maintained an abstraction of only a single step of the tran- 
sition relation. We build two sequences of relations over- 
approximating doubling number of steps of the transition rela- 
tion, which are useful both for detecting deep counterexamples 
and as a source of candidates for safe transition invariant. 
Loop acceleration [2, 7, 19] is a loop analysis technique that 
can prove safety and detect deep counterexamples. However, 
on its own, it is applicable only to limited types of integer 
loops. Acceleration have also been successfully integrated into 
interpolation-based model checking [9, 25] where interpolants 
computed from accelerated paths lead to much better ab- 
straction refinement in the traditional CEGAR algorithm [14]. 
In contrast, SPLIT-TPA computes transition interpolants, not 
state interpolants. It also does not try to capture all possible 
behaviour of a loop (by accelerating it). Instead, it builds over- 
approximations of (exponentially increasing) bounded number 
of iterations. By relying purely on Craig interpolation it can 
handle transition relations where acceleration is not possible. 
The k-induction principle [20] has been successfully used 
as a replacement for basic inductive reasoning in IC3-style 
algorithms [21, 23, 29]. k-inductive invariants can be more 
compact than inductive invariants and for some theories 
k-induction is a strictly stronger proof rule [29]. SPLIT- 
TPA uses both inductive reasoning (applied to TPAS) and 
k-inductive reasoning (applied to TPA~) to discover transition 
invariants. We believe that SPLIT-TPA’s success on challeng- 
ing systems can be in large part attributed to the inclusion of 
k-inductive reasoning, which was missing in TPA [5]. 


VIII. CONCLUSION 


In this work we have presented SPLIT-TPA, a novel in- 
stantiation of a recently introduced TPA approach. Splitting 
the transition power abstraction into two complementary parts 
makes the algorithm more efficient in proving safety by de- 
tecting safe transition invariants while still retaining and even 
improving the capability of detecting long counterexamples. 
The advantage of our instantiation has been confirmed experi- 
mentally on a set of challenging multi-phases benchmarks and 
on an extensive general benchmark set from CHC-COMP. The 
experiments also show that SPLIT-TPA is both competitive 
and complementary compared to state-of-the-art in safety 
verification. As the next step, we plan to study extensions of 
SPLIT-TPA from transition systems to general CHC systems. 
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Abstract—Avoiding collisions between obstacles and vehicles 
such as cars, robots, or aircraft is essential to the development 
of autonomy. To simplify the problem, many collision avoidance 
algorithms and proofs consider vehicles to be a point mass, 
though the actual vehicles are not points. In this paper, we 
consider a convex polygonal vehicle with nonzero area traveling 
along a 2-dimensional trajectory. We derive an easily-checkable, 
quantifier-free formula to check whether a given obstacle will 
collide with the vehicle moving on the planned trajectory. We 
apply our active corner method to two case studies of aircraft 
collision avoidance and benchmark its performance. 


I. INTRODUCTION 


Preventing collisions with obstacles or foreign objects is 
crucial when developing autonomous capabilities for robots, 
cars, aircraft, and many other vehicles. As such, collision 
avoidance remains a major research theme of the autonomy, 
robotics, and formal methods communities. In particular, for 
safety-critical tasks such as vehicles interacting with humans 
or animals, it is imperative to provide formal proofs that the 
vehicle will not collide with agents in its environment. 

In many papers studying trajectory planning or collision 
avoidance, e.g. [34], [3], [20], the vehicle is modeled as a 
point, and the volume — or surface area — occupied by 
the vehicle is ignored. In reality, land and air vehicles are 
not points but have a certain volume, and contact of any 
external object with any part of the vehicle would constitute 
a collision. In this paper, we present a novel, automated, 
and general technique to transform a planned trajectory of 
a vehicle with volume into explicit boundaries of the region 
in which an obstacle will not be at risk of a collision. This 
transformation provides an efficient, runtime-checkable test to 
determine whether a given obstacle will collide with a vehicle 
on the planned trajectory, even when the vehicle has volume. 

Given a part of a trajectory 7, a vehicle occupying the 
volume u(x7,y7) when centered on position (x7, y7) along 
the trajectory, and a point-obstacle (xo, yo), the vehicle will 
not collide with the obstacle if and only if: 


Vier, yr) € T, (to, Yo) € v(r7, yr) (1) 


In the rest of the paper, we will call this formulation the 
implicit formulation of collision avoidance. This implicit for- 
mulation is a correct definition, but it has one major drawback: 
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because of the universal quantifier on (27, yr), it is not easy 
to check systematically or at runtime whether an obstacle is 
indeed at risk of a collision. Ideally, we would want to obtain 
a quantifier-free, easily checkable formula that is equivalent to 
(1); in the rest of this paper we will call that formula, which 
represents a region in the plane, the explicit formulation. In 
theory, one could use quantifier elimination, but for trajectories 
containing more than a few symbolic parameters, the algorithm 
does not finish in a reasonable time due to its doubly- 
exponential time complexity in the number of variables [14]. 

This issue arose before, notably in the verification of 
the Next-Generation Aircraft Collision Avoidance System 
ACAS X [22], [23]. In that work, the formal proof of cor- 
rectness was divided into: (i) establishing the trajectory of the 
aircraft from its equations of motion, leading to a formula of 
the form of (1); and (ii) establishing an equivalent quantifier- 
free formula that can be checked efficiently at runtime. Both 
tasks required a proof in the Ke Ymaera X theorem prover, with 
significant manual effort [35]. A similar approach was used 
in the verification of collision avoidance for ACAS Xu, the 
unmanned version of ACAS X, with horizontal maneuvers [1]. 
The object of this paper is to automate and generalize task (ii) 
of this process. 

In order to automate task (ii), we propose a different 
approach based on geometric intuition. Let us examine an ex- 
ample of a rectangular vehicle performing a simple maneuver 
(Figure 1). The central idea of the method presented in this 
paper is that the boundaries of the explicit formulation are 
either trajectories of a corner of the vehicle or sides of the 
vehicle at a few particular points. 

The corners to consider at every point depend on the slope 
of the trajectory: for a rectangular vehicle, the boundaries 
follow the top-right and bottom-left corners when the vehicle’s 
velocity is directed “northwest” (towards the top left) or 
southeast; and, symmetrically, the boundaries follow the top- 
left and bottom-right corners when the vehicle’s velocity is 
directed towards the northeast or southwest of the plane. We 
call these corners active corners. But this is not enough: at 
points where the trajectory switches from following one set 
of corners to another, the boundary may follow a side of 
the vehicle at that point, e.g., its bottom boundary at the 
lowest point of the trajectory on Figure 1. We call these points 
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Fig. 1: Safe region for a rectangle with w = 2,h = 1 and 
its center following Equation (2). The trajectory is dashed 
in purple, safe region shaded in green, and unsafe region is 
unshaded. 


transition points. By capturing the motion of these boundaries 
— both corners depending on the slope or the sides of the 
vehicle at certain points — we can construct a quantifier- 
free formula equivalent to (1), corresponding to its equivalent 
explicit formulation. Our approach is fully symbolic with no 
approximation. 

In this paper, we formalize and generalize our approach 
to different trajectories and polygons and show how to find 
active corners and transitions symbolically and how to form 
a quantifier free explicit formulation equivalent to the input 
implicit formulation. We carefully prove that our transfor- 
mation is both sound (any obstacle shown safe using the 
explicit formulation is safe using the implicit formulation) and 
complete (no obstacles that are actually safe appear unsafe 
using the explicit formulation). Finally, we detail a fully 
symbolic Python implementation of our work and present 
an evaluation of its performance on two applications from 
previous papers (where we fully automate results that had 
required significant manual proof effort) and a third non- 
polynomial example. 


II. OVERVIEW 


This section provides an overview of our approach, by 
walking through a simple example constructing a geometric 
safe region used to verify obstacle avoidance for aircraft. At 
present, our method applies only to two-dimensional planar 
motion due to the increased complexity of three-dimensional 
motion when analyzing trajectories and polyhedra. The ex- 
ample uses linear motion and a rectangle, though the method 
generalizes to other planar motion and convex polygons, as 
detailed in Section IV. 


A. Trajectory 


Consider the planar side-view of an airplane flying, initially 
descending at constant velocity and then ascending with con- 
stant velocity. For simplicity, assume the aircraft has infinite 
acceleration. In this example, we represent the bounds of the 


aircraft as an axis-aligned rectangle with width 2w and height 
2h that moves in the (x,y) plane. The airplane begins at the 
origin and moves in the plane with piecewise trajectory 7: 


raft x € [0,5] 


y=x-—15 «€ [5,c) e 


B. Implicit Formulation 


Suppose the rectangle translates with its center moving 
along this piecewise trajectory. Additionally, assume there is a 
point obstacle at (xo, yo) to be avoided; that is, the rectangle 
never intersects the obstacle. Then we can state an quantified 
(or implicit) formulation of obstacle avoidance: 


Vier, yr) E€ T, (Ito -27|>wV|lyo—yr|>h) 6) 


The implicit formulation of the safe region (3) straightfor- 
wardly represents a safety property of obstacle avoidance — 
if an object moving with its center fixed along the nominal 
trajectory 7 is far enough away (either width w in the x-axis 
or height h in the y-axis) from a point obstacle at (xo, yo), 
then the object is safe. Here we use safe region to mean the 
set of all obstacle locations for which collision is avoided. 


C. Explicit Formulation 


The need for a quantifier-free equivalent of Equation (3) 
motivates an explicit formulation of the safe region for the 
obstacle. The goal of this work is to automate the generation 
of such a formulation. We compute the reachable set of the 
object as it moves along a trajectory in order to compute the 
complement of the safe region: the unsafe region. We can 
express the unsafe region (the set of all locations for which an 
obstacle will collide with the object as it moves along a given 
trajectory) directly as a union of regions in the plane, each 
defined (in this case) by an intersection of linear inequalities 
(Equation (4)) bounding the region, plotted in Figure 1. The 
safe region is simply the negation of (4). 


(o > —w) A (yo < h) A (yo = -2%0 — 2w — h) 


8 


(to <5+w) A (yo < —2ro + 2w + h) 

(yo = -10-h)) V (@o > 5-w) (4) 
(yo 2 -10—h) A^ (yo < ro +wt+h-—15) 

(yo = zo = w—h— 15)) 


Note the first disjunction in Equation (4) corresponds to the 
motion of the aircraft on the left side of Figure 1 as it descends; 
the second corresponds to the right side as the aircraft ascends. 


II. ALGORITHM 
A. Preliminaries 


In this work, we define the safe region as the set of obstacle 
locations where, given a polygon’s trajectory, a collision will 
not occur. Correspondingly, the unsafe region will be unsafe 
if an obstacle invades its area. As such, the unsafe region 
corresponds to the reachable set of the polygon as it moves 
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along a trajectory. We can define a quantified representation 
of the safe region: the implicit formulation from (1). We also 
use the term explicit formulation in this work; we use that to 
mean an equivalent to the implicit formulation, but without 
quantifiers like V and 3. Our method primarily applies to 
convex polygons. In this paper, we discuss polygons with 
central symmetry for ease of exposition, though the method 
straightforwardly extends to irregular and asymmetric convex 
polygons, and can be extended to concave polygons (Sec- 
tion III-F). 

We consider two-dimensional planar trajectories defined 
piecewise, with each piece a function y = f(x) or x = f(y) 
and f a Ct! function (differentiable and having a continuous 
derivative). Trajectories must have a finite number of these C+ 
pieces. The pieces themselves need not be continuous, though 
the applications we study do include continuous piecewise tra- 
jectories. The subdomains for the piecewise trajectory must be 
non-overlapping and exhaustive, meaning their union should 
cover the entire domain of the trajectory. Polygons move along 
the trajectory without rotating. Since the polygons translate 


Ti 
Ayi 
from the center to the i-th vertex, and there are n vertices 
for an n-sided polygon. Thus, the trajectory for vertex 7 is 
y—Ay; = f(x— Arz;) or z— Ar; = f(y—Ay;). We consider 
the trajectories of all vertices of the polygon in an attempt to 
bound its motion and compute the reachable set of the object 
as it moves along the trajectory. 


along the trajectory, there is a constant vector offset 


B. Active Corners 


Throughout this section we consider a trajectory of the 
form y = f(a); the case for x = f(y) is symmetric. The 
boundaries of the safe region are (for centrally symmetric 
polygons) formed by the trajectories of a pair of opposite 
corners of the vehicle (Figure 2) — we call this pair of corners 
active corners. For asymmetric polygons, the corners may not 
directly oppose each other. 

We choose the active corners to represent the outermost 
extent of the object along the trajectory; as such, their motion 
bounds the safe region. Which corners are active depends on 
the slope of the trajectory (which can be computed from the 
derivative of f) and the shape of the convex object. A corner 
v; is active when the slope 0 of the trajectory is between the 
slopes of the sides adjacent to v;; when a corner is active, its 
opposite corner is also active based on the symmetry of the 
polygon. More precisely, if we number the corners vı through 
Un counter-clockwise (with v,,41 = Vo and v_; = Vn), corner 
vi is active if and only if the slope 0 of the trajectory is in 
the angle interval [Zv;_1v;, Zv;0;4.{], or symmetrically in the 
angle interval [Lono 3 Lo]. Because the direction of the 
trajectory is inconsequential for our purpose, 0 is modulo 180°. 

For example, on the hexagon in Figure 2, vı and vq are 
active when 0 € [0°, 60°] U [180°, 240°]; v2 and vs are active 
when 0 € [60°, 120°] U [240° , 300°]; and v3 and v¢ are active 
when @ € [120°, 180°] U [300° , 360°]. 


9 € [180°,240°] +" Ø € [120°,180°] 


6 € [240°, 300°] 6 € [60°,120°] -47 


6 € [300°, 360°] 6 € [0°, 60°] 


Fig. 2: A hexagon, the angles of its sides, and shifted active 
corner-trajectories 


At transition points (where the active corners change), the 
boundary of the safe region may not follow an active corner. 
We detail what happens then in Section II-C. Note that when 
a linear trajectory is parallel to a side of the polygon (e.g. 
0 = 60° for the hexagon in Figure 2), two adjacent corners 
may both be active and either can be classified as such. For 
such trajectories, the active corners technically do not change 
for the length of the linear path, so there would be no transition 
points as long as the trajectory parallels a polygon edge. 


Given an obstacle at point (xo, yo), we can check if it is 
inside the unsafe region (or reachable set) in a computationally 
efficient fashion. If an obstacle lies outside the unsafe region, it 
would be either above both corner-trajectories or below both 
corner-trajectories for whichever corners are active. We can 
express the location of the obstacle with respect to a corner- 
trajectory in a single equation by considering the value of 
yo—f(xo—Az;)—Ay; for active corner (vertex) v;. This term 
will be positive for both vertices v;,v,; if the obstacle is above 
both corner-trajectories, and similarly negative if the obstacle 
is below both. Therefore, any point (xo, yo) in the safe region 
has a positive value for the product of the two expressions 
above, and any point in the unsafe region has a negative or 
zero value for this product. This yields the following test to 
check if an object lies in (part of) the unsafe region: 


f(xo — Arzi) 


(yo Ayi) (yo — f (zo — Az;)— Ay;) < 0 
(5) 


where Ax;, Ayi, Ax;, Ay; are the (constant) offsets from the 
center of the polygon to the active corners (vertices) v;, vj. 

When implementing this algorithm, the trajectories of all 
other vertices lie within the trajectories of the active corners, 
so to check whether an obstacle lies in the portion of the 
unsafe region defined by the active corners, it suffices to check 
over all pairs of vertices (v;, vj) with i,j € {1,2,...n}. This 
check can be made more efficient by considering only all 
possible pairs of active corners based on the polygon shape 
and discarding, say, pairs of adjacent vertices. If the test in (5) 
indicates that an object lies within the unsafe region, trajectory 
y = f(x) or x = f(y) is clearly unsafe. 
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Fig. 3: A rectangular airplane moving along a planar trajectory. 
At the transition point at the parabola’s vertex, the “notch” is 
visible and shaded in red; part of the object lies outside the 
corner-trajectories at this point. 


line between 
active corners 


line between 
active corners 


Fig. 4: For piecewise functions, between transition points 
and/or piecewise boundaries, this figure shows the difference 
between f(-) and g(-) and the two additional checks on 
(xo, yo) described in Section I-D. 


C. Notches at Transition Points Between Active Corners 


It turns out that using only active corners would yield an 
underestimate of the reachable set, which would be unsound 
for verifying safety. Figure 3 illustrates why: the white area 
bounded by the trajectories of the corners does not contain 
the red “notch,” even though a collision would occur with an 
obstacle in this notch. Therefore, if the test in (5) yields a value 
> 0, the trajectory is not necessarily safe; we additionally 
check safety at all transition points (ar, yr) to see whether 
the obstacle at (xo, yo) lies within the polygon centered at 
(xr, yr). Recall that transition points are defined as points 
on the trajectory where the active corners switch. In the 
full test for safety (Equation 7), this check is represented as 
in_polygon() and can leverage one of many point-in-polygon 
implementations, which generally run in linear time on the 
order of number of vertices. As the slope of the function may 
change at the boundary between piecewise subfunctions, we 
also add a notch check at each subdomain boundary. 


D. Handling Piecewise Functions 


In order to account for piecewise functions, we modify our 
method in two ways to avoid using a subfunction outside 
the subdomain over which it holds. The first is a modifica- 
tion to hold subfunctions constant outside of the subdomain 
over which they’re defined; and the second is an additional 


RASS) 
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Fig. 5: Terms in Equation (7), illustrated. Green terms re- 
strict test to relevant piecewise subdomains, blue, diagonally- 
hatched terms check if obstacles are between active corner 
pairs, and orange, horizontally-hatched terms check if ob- 
stacles are in the notch at transition points and subdomain 
bounds. 


boolean clause to the safety test in (5) so it only applies 
over a valid subdomain. In this case, the subdomain is an 
interval [ETk, Er(k+1)]; Where £Tk, XT(~41) May be piecewise 
boundaries or transition points. Because of this, there may be 
many subdomains for a single piecewise case in which there 
happen to be many transition points. 

First, we define a function g(x) (or g(y) symmetrically) 
that holds constant the value of each subfunction outside of 
its subdomain [£Tk, &r(k+1)]. The function g(-) is used in 
place of f(-) in (5) above. Let yrk = f(xrx). 


YTk if £ < rrp 
g(x) = § f(z) if Tk < Z < TT(k+1) (6) 
YT(k+1) If £ > ET(k+1) 


Additionally, we add a clause to ensure the modified sub- 
function g(-) is only used over the correct subdomain. First, 
we check rrp — w < ®o < TT(k+1) + w, where w is the 
half-width of the object. We also construct a line between the 
two active corners of the object in each of the two piecewise 
boundary locations (tre, yrk) and (ET(k+1) YT(k+1)) and 
check (xo, yo) is between the two lines. This way, we ensure 
the test for being unsafe holds only for the region on which 
each subfunction applies. Figure 4 illustrates the function g(-) 
and the additional subdomain-related clauses. 


E. Generic Explicit Formulation 


This leads to a generic quantifier-free explicit formulation 
to test whether an obstacle is in the safe region, where 
{(xr,yr)«} represents the set of all transition and boundary 
points on the trajectory between piecewise subdomains. g(-) 
is used as defined previously in Section I-D. As defined 
previously, Ax;, Ay;, Ax;, Ay; are the (constant) offsets from 
the center of the polygon to active corners v; and vj. Our 
algorithm generates a test for whether an obstacle is unsafe 
Gf a collision will occur); negating the boolean formula or 
its result allows testing whether an obstacle lies in the safe 
region. 

Equation (7) is color-coded in correspondence with Fig- 
ure 5. The first, third, and fourth lines ensure the test applies 
only over the correct piecewise domain and are in green; the 
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unsafe? = VV 


{(IkETkET(k+1))} 


(Line((arx + Axi, yrk + Ayi), (Erk + Avy, yrk + Ay;)), Line ((2r(e41y + Ati, Yri) + Ayi), (Er) + ATi, Yri + Ay;)))A 


TTk— W < To <@7(K41) +W A (LO, Yo) between 


(yo — g(zo + Az:) — Ay:) (yo — g(zo + Aaj) — Ayj) <0 | Vv in_polygon(x7;, Yri, tO, Yo) 
V 


{(vi,vj)} 


second line checks the obstacle is between the active corners 
and is in blue; the fifth line is in orange and checks for the 
notch at transition points and piecewise subdomain boundaries. 


F. Extensions 


Thus far, we have considered point-mass obstacles, but the 
reasoning extends to obstacles that have the same properties as 
the object (convex and centrally symmetric). This is achieved 
through a reduction where the shape of the obstacle is incorpo- 
rated into the shape of the object. For example, in the simple 
case where both are horizontal rectangles, with the object of 
height 2h and width 2w, and the obstacle of height 2ho and 
width 2wo, the object and obstacle intersect if and only if 
the center of the obstacle is contained in a virtual object with 
the same center as the initial object, but of height 2(h + ho) 
and width 2(w + wo). We have thus reduced the problem 
of collision avoidance with a convex object to a problem of 
avoidance with a point-mass object. A similar reasoning — 
albeit a little more complicated — can be applied to any 
convex, centrally symmetric obstacle. 

For ease of presentation, and because they appear in most 
practical applications, we have focused on objects that are 
convex and centrally symmetric. We can extend the reasoning 
to non-centrally symmetric objects: the only difference in that 
case is that pairs of active corners do not change together, but 
rather one active corner may change on one side, and another 
active corner may change on the other side later. Pairs of active 
corners are thus not opposite corners of the object anymore. 
The convexity of the object (and obstacles) is essential for 
active corners; however, we can extend our reasoning to non- 
convex, polygonal objects by seeing them as unions of convex 
sub-objects and ensuring collision avoidance with each sub- 
object. Finally, due to its reliance on comers, our method 
cannot handle circles or ellipses, but they can be approximated 
by polygons. 


IV. PROOF OF EQUIVALENCE 


We prove the equivalence of the safe regions represented by 
1) the implicit formulation and 2) the explicit output of our 
active-commer method for trajectories of form y = f(a). The 
proof of soundness follows; the proof of completeness is in our 
full paper on arXiv. The proof structure considers segments 
of the trajectory in which no active corner switch occurs; that 
is, where the angle of the tangent to the trajectory is bounded. 
In these segments, the bounds on the trajectory tangent angle 


{(zr,yr)i} 


(7) 


allow us to bound the location of points in the interior of the 
polygon and show they lie between the two active corners. The 
two endpoints of a segment represent locations at which 1) the 
notch exists or 2) the trajectory switches to a new piecewise 
subfunction. In our method, these cases are handled by testing 
if obstacle (xo, yo) is inside the polygon at various transition 


points {(rr,yr)},- 


A. Proof Preliminaries 


Consider a segment of the motion along trajectory y = f(x) 
or x = f(y) in which no active corner switches or piecewise 
trajectory segment switches occur. We can arbitrarily rotate 
this segment of motion and the proof will hold, since the 
object translates along the trajectory without rotation. Assume, 
then, that a rotation is made by an angle @ such that the 
active corners are oriented along a vertical line. This rotation 
is an invertible transformation, so the logic of this proof 
holds through the entire trajectory. Because of this coordinate 
rotation, we consider only trajectories y = f(x) for the proof; 
any trajectory x = f(y) can be rotated into the form y = f(x) 
invertibly, so our results hold for these forms as well. Let v;, vj 
denote the active corners for this segment, with corresponding 
offsets Ax;, Ayi, Ax;, Ay; in the rotated coordinate system. 

Since no active corner switch occurs, then we know the 
slope of function y = f(x) is limited by the shape of the 
polygon itself — let these bounds be +m, with m representing 
the slope of the relevant sides of the polygon. Slopes of f 
beyond this range cannot occur over the trajectory segment 
in consideration, due to our assumption that the no active 
corner switches occur. Because the polygon is symmetric, the 
lower bound on slope is the negative of the upper bound 
(illustrated further in Figure 7). This proof is presented consid- 
ering regular, symmetric polygons for simplicity, but extends 
to asymmetric polygons as discussed in our full paper on 
arXiv. To prove the soundness of our method, we must prove 
that all obstacles shown safe using our method (safegxp)) are 
also safe using the input implicit formulation (safejmp1). To 
prove safeexp ==>  safeimp, we prove the contrapositive 
unsafejmp) => unsafegxpy. 

Specifically, unsafejmp means that an obstacle at (xo, yo) 
is inside a polygon centered at some coordinates (x,y); 
unsafeexp means that the below holds from (5): 


Ayi) + (Yo 


(yo — f(to — Ari) f(to — Az;) — Ay;) < 0 
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Fig. 6: Sections of proof 
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Fig. 7: Figure of the slope of the sides of a regular hexagon 
and octagon. 


This proof has three sections: one holds for the majority of 
the trajectory segment, one for the beginning of the segment, 
and one of the end of the segment. The beginning- and end-of- 
segment proofs follow the form of the main proof but consider 
polygons fixed at the trajectory segment endpoints. They are 
included in our full paper available on arXiv. 


B. Middle Segment Proof 


Consider a (symmetric) polygon P, with half-height h and 
half-width w, centered at (xc, yc), where yc = f(xc). Let 
(Zint, Yin.) be a point inside or on the edges of P. We prove 
that interior point (int, Yint) lies between the active corners of 
an identical polygon P located at (int, f(Zint)). We do this 
by bounding three terms: 1) f (xin), 2) Yint, and 3) the active 
comers of P to prove that yin lies between them. 

First, we bound f (int) (the center of P). Let £in = £c + 
Aa, for Ax € [—w, w]. Let f (£in) = f(w~o+Azxr) = yo+Ay, 
for some Ay which we will bound. The slope of the trajectory 
ou is bounded by (—m, m), because this proof considers a seg- 
ment of motion with no changes in active comer. Hence Ay is 
bounded proportionally to Az, with Ay € (—|mAg|, |mAz]). 
Therefore, f (int) E€ (yo — |mAz|, yo +|mAz]). Our proof 
proceeds assuming Ax 4 0, since if Ax = 0, £in will lie on 
the vertical centerline of P. In that case, it is trivial to show 
Xin lies between the active corners. 

Recall vin. = tc + Az, for Ax € [—w,w]. Let Yin = 
yo + Ayin, for some Ayin which we will bound. Given that 
the slopes of the sides on the top and bottom of P are +m, 
we assert that any (Zint, Yin) With Zin = vo + Aw has a 
corresponding Ayin € | — h + |mAa|,h — |mAz|]. This is 
illustrated in Figure 7 with a hexagon, but it generalizes to 
any symmetric convex polygon. Given this, we can bound the 
x and y interior coordinates as below: 


to + Ax 


vo -h+ [mAs], ye +h- mAs] © 


(Zint, Yint) = 


“Zin ii 
Fig. 8: Shifted polygon illustration 


Finally, we show interior point y-coordinate yint lies within 
the active corners of P. Because we consider a rotated 
coordinate frame such that the active corners are oriented 
along the vertical axis, the top and bottom active corners are 
located at (Top, Yop) = (Zint; f (Lin) + h) and (Zor, Yoo) = 
(xint, f (Zin) — h), respectively. The bounds on Yop and For 
are given by the following: 


yc — |mAz| +h < Jop < yo + |MAaz| +h 


yc — |mAz]| — h < You < yo + |mAz| — h 0) 
Then yim < yo—|MAa|+h < Fop and yin > yot+|mAx|— 
h > Yoo: 
The top active corner trajectory is given by ftop(z) = f(x)+ 
h and the bottom active comer trajectory is given by feol £) = 
f(x) —h. By definition, frop(%int) = Yop and for(int) = Toots 
or equivalently, Yop — frop(Zint) = 0 and Yoor — foot(Lint) = 0. 
Since Yint < Yrop and Yint > Yoots 


Yint — Ffeop (int) <0 Yint — Foot (int) >0 (10) 
By multiplying the equations in (10), we get 
(Yint =| fig tw) . (Yint a foot (Zint)) < 0 (11) 


This is an equivalent test for whether an object lies in the 
unsafe region from (5). Therefore we have shown that for all 
(xc, yc) points satisfying yo = f(xc), all points (int, Yint) 
inside and on the boundary of a polygon centered at (zc, yc) 
also have (fiop(Xin.) — Yint) * (foot(int) — Yin) < 0. These are 
exactly the definitions of unsafeimpı and unsafeexpı from IV-A 
earlier. Therefore, we have shown unsafeimp) — > unsafeexp) 
and the contrapositive safeexp => safeimpı holds as well. 


V. IMPLEMENTATION 


We have implemented our automated method in Python 
using SymPy, a symbolic math library [30]. The code 
implementing our algorithm and the applications in Sec- 
tion VI is available on GitHub at https://github.com/nskh/ 
automatic-safety-proofs. 

Given a fully symbolic trajectory and object, we first 
identify the angles corresponding to sides of the object. Then, 
following Section II-B, we identify points on the trajectory 
corresponding to the angles 0; of each side of the object. 
To avoid discontinuities in the arctan function, the imple- 
mentation solves a reformulation: of sin(6;) = SE cos(0;). 
Solving this equation may yield either y in terms of x or the 
reverse. In this case, we substitute the implicit solution for 
x or y into the trajectory equation, eliminate the remaining 
variable, and identify transition points. Given transition points, 
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Fig. 9: Safe region for an instance of [22]. The notches are the 
red-hatched rectangles and the trajectory is dashed in purple. 


we can implement the test from (7). SymPy includes a “point- 
in-polygon” method, which we use to identify if an obstacle 
(xo, Yo) lies in the “notch” at any transition point. The output 
explicit formulation can be expressed either in TEX or in 
Mathematica format; output in Mathematica also supports 
generating code to copy-paste directly and plot the safe 
region using Mathematica’s RegionPlot[] functionality. 
Examples can be found in Figures 9 and 10. 

In order to implement our method in a fully symbolic 
fashion, we must account for the potential values of symbols 
when instantiated. We can leverage SymPy’s built-in “as- 
sumptions” to specify that certain symbols representing, say, 
trajectory parameters or object dimensions are real, positive, 
and/or nonzero, but these assumptions may not suffice to 
construct a fully symbolic safe region. In that case, our fully 
symbolic implementation computes a number of potential valid 
safe regions. As detailed in Section HI-D, we construct the 
explicit formulation using many clauses defined on intervals 
between transition points and/or piecewise boundaries. In the 
symbolic case, the order of these terms may differ, depending 
on, say, the sign of a variable in the trajectory. Additionally, 
symbolic piecewise cases for, say, x < b may mean that certain 
transition points do not occur at all if b lies in some range. Cor- 
respondingly, our fully symbolic implementation computes all 
valid orderings of piecewise boundaries and transition points; 
it additionally considers all valid combinations of transition 
points to account for “notches” that may not exist when 
piecewise bounds and/or trajectory parameters are instantiated. 
In order to check if orderings are valid, we attempt to sort 
using the SymPy assumptions: if we know b is positive, no 
returned ordering will place b before a transition point at 0, 
for example. Additionally, we enforce that adjacent points in 
the ordering “come from” the same functions: we will not 
return an ordering where a transition point from piecewise 
subfunction fı lies between the piecewise boundaries for fo. 
Doing so ensures that we often generate relatively few (~ 10) 
potential orderings despite considering many combinations, 
though examples with intractably many orderings do exist. 


VI. APPLICATIONS AND EVALUATION 


A. Verification of vertical maneuvers in ACAS X 


A collision avoidance system intended to prevent near mid- 
air collisions, ACAS X, was verified in [22]. The KeYmaera X 
proof presented in [22] required a significant amount of human 


interaction (on the order of hundreds of hours), while the 
method presented in this paper generates an explicit formula- 
tion from the trajectory fully automatically. ACAS X prevents 
collisions between aircraft by issuing advisories (control com- 
mands) to one aircraft, the ownship. The bounds of aircraft in 
this work are shaped like hockey pucks (cylinders wider than 
they are tall) of a radius r, and half-height h,. From a side 
perspective of an encounter between aircraft, the bounds are 
rectangular. In [22], verification was performed in a side-view 
perspective, assuming two aircraft approach each other in a 
vertically-oriented planar slice of three dimensions. A careful 
choice of reference frame can reduce a three-dimensional 
encounter between aircraft into a two-dimensional system, by 
modeling the encounter as a 1-dimensional vertical encounter 
and the distance of a horizontal encounter [22, Section 6]. 
To simplify calculations, [22] used the relative horizontal 
speed r, of the two aircraft and assumed it constant; the 
vertical velocity of the oncoming aircraft h is also assumed 
constant. Advisories consist of climb and descent speed 
advisories, yielding ownship trajectories that are piecewise 
combinations of parabolas and straight lines. One example 
trajectory is below in (12), which assumes the advisory issued 
is for the ownship to climb at a rate h yf greater than its current 
vertical velocity ho. (r+, h¿) are the (x,y) coordinates for 
trajectory 7 in this example, and a, is the acceleration. 


ryt, $ + hot) for 0 < ¢ < H 


(re, he) = ryt, hjt — Master 


for "£0 <4 


ar 


2a, 

(12) 

The implicit formulation of the safe region is below, for an 
oncoming aircraft at relative coordinates ro, ho. 


Vt.VreWVhe.((re, ht) ET => |ro—re| > rpV|ho—he| > hp) 
(13) 
In [22], the authors eliminate the parametrization over t, 
which yields an initial parabolic section and then straight- 
line motion after. We use this ¢-free trajectory to compute 
the unsafe region, which is displayed in Figure 9. A boolean 
formulation of the unsafe region is available in our full paper 
on arXiv. 


B. Verified Turning Maneuvers for Unmanned Aerial Vehicles 


Turning maneuvers for unmanned aerial vehicles (UAVs) 
have been verified as safe in [1], where the UAV was rep- 
resented as a circular safety buffer around a point object 
fixed along the trajectory. The KeYmaera X proof presented 
in [1] required a significant amount of human interaction 
(on the order of hundreds of hours); in contrast, the method 
presented in this paper generates an explicit formulation from 
the trajectory fully automatically. This work represents motion 
in a two-dimension plane viewed top-down, with the buffer 
“puck” taking the form of a circle. The turning maneuver 
trajectory moves along a circular arc then in a straight line: 


r2 +yz = R? 


— Rcosĝ—rr 
YT = tan 0 


( yr < x7 tanl 
TT, = f 
PEM + Rsinb yr È zrtan0 


(14) 
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Fig. 10: Approximated safe region for an instance of [1]. The 
notches are the red-hatched hexagons, the trajectory is dashed 


in purple. 


With a circular safety buffer of radius r,, the implicit 
formulation ensures for all points along the trajectory, the 
obstacle (zo, yo) is at least r, away. 


Ver Vyr.(traj(e7, yr) => (to—-27)’+(yo-yr)? > rp) 
(15) 
Note that our method does not support circular objects, only 
polygons, so we overapproximate the circular safety buffer 
as a regular hexagon inscribing a circle. This approximation 
allows a valid overapproximation of the unsafe region, since 
the hexagon contains the original circle in [1]. Note that the 
approximation of a circle can be made arbitrarily precise by 
increasing the number of sides of a polygon used. A plot of 
the unsafe region is in Figure 10 and a boolean formulation 
of the unsafe region is in our full paper on arXiv. 


C. Runtime Evaluation 


This section presents a comparison of our method to quanti- 
fier elimination via cylindrical algebraic decomposition (CAD) 
[12]. We consider a variety of cases, from fully numeric to 
fully symbolic. Fully symbolic cases use trajectories without 
real constants, like az? + bz + c = d, and polygons with 
variable dimensions like rectangles of width w and height h. 
Fully numeric cases instantiate all parameters with reals to 
yield, say, trajectory 4x7 + 2x + 1 and a rectangle of width 2 
and height 1. In Table I, our “Numeric Trajectory” examples 
instantiate only the trajectory with reals but leave the polygon 
symbolic, and the “Numeric Hexagon/Rectangle” examples 
leave the trajectory symbolic but use reals for the polygon 
dimensions. 

Results were generated using a 2017 iMac Pro workstation 
with 128 GB of RAM, with CAD results using Mathematica’s 
Resolve implementation. A table of results is shown in 
Table I. We use the examples from VI-A (ACAS X) and VI-B 
(UAV). The Dubins path example is inspired by common path 
planners and takes the form of two circular arcs connected by 
a straight line and ending with a line; its symbolic trajectory 
equation is included in our full paper on arXiv. 


Our findings in Table I demonstrate the advantages and 
disadvantages of our method relative to quantifier elimination 
using CAD. For non-polynomial examples like a rectangle 
moving along the Dubins path described above or the UAV 
example from [1], the active corner method is able to compute 
fully symbolic formulations of the safe region when CAD fails 
to return an answer when run overnight (8+ hours). We do 
note that due to the complexity of a symbolic hexagon moving 
along the Dubins path, the number of transition points means 
our method cannot compute an answer, though neither can 
CAD. For a fully numeric example from [1], CAD took 2381 
seconds to run but returned False incorrectly in place of a 
region. Additionally, memory is often a constraint for symbolic 
computation given the CAD algorithm’s doubly-exponential 
runtime [14]; many examples consumed 100+ GB of RAM and 
one case grew to consume 350 GB of RAM without returning 
an answer. In the worst case, however, our method consumes 
under 100MB of RAM. On the other hand, for strictly poly- 
nomial examples like that in [22], CAD runs quickly and 
efficiently, though our method remains competitive. 


VII. RELATED WORK 


Reachability computation is a vital question in safety- 
critical cases where users seek to guarantee properties or 
behavior. One method of constructing reachable sets for safety 
is zonotope reachability [3]. Reachability computation using 
zonotopes offers efficient algorithmic methods and supports 
analysis of dynamical systems with uncertainty. Zonotopes 
have been used in verification of automated vehicles [2], the 
design of safe trajectories for quadrotor aircraft [24], and the 
analysis of power systems [15], among other applications. 
Zonotope reachability methods discretize a dynamical system 
and iteratively propagate an estimate of the reachable set 
forward in time. Their input is a differential equation, while 
our method requires an explicit closed-form trajectory. For the 
purpose of checking safety, the estimate of the reachable set 
must be either exact or an overestimate; in order to deal with 
discretization error, zonotope methods repeatedly overestimate 
the reachable interval. Zonotope methods for nonlinear sys- 
tems rely on linearization and again account for error that may 
occur by expanding the reachable set [4]. Our method yields 
exact reachable sets. While it is possible to model convex 
object reachability with zonotopes, the reachable set expands 
with the time horizon because the dimensions of the object are 
treated not as constant dimensions but as uncertainty in initial 
conditions that is propagated forward through time [20]. 

Interval-based reachability methods share similarities to 
zonotope methods but do not aim for the tightest approx- 
imations possible; instead simpler axis-aligned sets (hyper- 
rectangles in high dimensions) are used for computation [32], 
[33]. These representations simplify storage in memory and 
operations like intersections but do not compute exact esti- 
mates in the way our method does. However, they do support 
both continuous [31], [25] and discrete [13] dynamic systems 
with uncertainty. Set-valued constraint solving may be used 
but similarly relies on inexact discretization [21]. Other reach- 
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Example Instance Active Corners Time Active Corners RAM CAD Time CAD RAM 
UAV Fully Numeric 0.48 sec 7.1 MB 2381* sec 30.89 MB 
UAV Numeric Trajectory 0.82 sec 8.4 MB DNF 50+ GB 
UAV Numeric Hexagon 38 sec 22 MB DNF 100+ GB 
UAV Fully Symbolic 45 sec 24 MB DNF 100+ GB 
Dubins Fully Numeric 1.2 sec 9.0 MB DNF 11+ GB 
Dubins Fully Symbolic: Rectangle 4505 sec 91 MB DNF 4+ GB 
Dubins Fully Symbolic: Hexagon DNF N/A DNF 8+ GB 
ACAS X Fully Numeric 0.13 sec 5.9 MB 0.04 sec 160 KB 
ACAS X Numeric Trajectory 0.48 sec 6.6 MB 0.04 sec 188 KB 
ACAS X Numeric Rectangle 0.51 sec 6.6 MB 0.2 sec 325 KB 
ACAS X Fully Symbolic 0.57 sec 6.6 MB 1.1 sec 1.8 MB 


TABLE I: Evaluation results, with better results bolded. DNF: example did not finish in 8+ hours. *: incorrect answer. 


ability methods for differential equations include Hamilton- 
Jacobi reachability for systems with complex, nonlinear, high- 
dimensional dynamics [7], and control barrier functions, which 
enable the construction of safe optimization-based controllers 
[5]. 

A counterpart to reachability is automatic invariant genera- 
tion for hybrid systems, in which a formal statement showing 
a system never evolves into an unsafe state is proved. In 
[17], the authors proved a polynomial and its Lie derivatives 
can represent algebraic sets of polynomial vector fields. A 
procedure to check invariance of polynomial equalities was 
proposed in [18]. Semi-algebraic invariants for polynomial 
ODEs were studied in [19], [40], [28]. Invariants for hybrid 
systems were studied in [37] and [29]. Relational abstractions 
bridge the gap between continuous and discrete modes by 
over-approximating continuous system evolution to summarize 
the system as a purely discrete one using invariant generation 
[38]. Barrier certificates have also been used as invariants for 
safety verification in hybrid systems [36]. 

Our work has similar aims to swept-volume collision check- 
ing, from path planning and graphics, in which approximate, 
efficient collision-checking is performed as a volume is moved 
along a path. A convex over-approximation swept-volume 
approach was presented in [16]. Swept-volume checking in 
four dimensions was performed using an intersection test in 
space-time in [10]. An efficient algorithm computing distances 
between convex polytopes, the Lin-Canny algorithm, was pro- 
posed for this task in [26], [27]. Methods are typically discrete 
and approximate for performance in online applications. That 
said, there are some exact methods such as collision checking 
for straight-line segments like those on robotic arms [39] and 
an algorithm for large-scale environments [11]. However, these 
methods operate on individual collision checking instances, 
such as graphics simulations or video game environments, and 
their results cannot be used repeatedly. Our method yields 
provably correct, fully symbolic, and exact safe regions for 
continuous trajectories and supports, for example, quantifier- 
free and efficient testing in runtime or in large-scale settings 
once a desired safe region formulation has been generated. 

Another alternative to this work is quantifier elimination, 
a general algorithm for converting formulas with quantified 


variables into equivalent statements that are quantifier-free 
[41], [12]. Quantifier elimination can be performed using 
Cylindrical Algebraic Decomposition (CAD), an algorithm 
that operates on semialgebraic sets [12], [6]. QEPCAD is one 
notable software tool implementing CAD that could be used 
in this work [8]. The runtime of the CAD algorithm is doubly 
exponential in the number of total variables (not the number of 
quantified variables) [14], [9]; we offer a detailed comparison 
to CAD in Mathematica in Section VI-C. 


VIII. CONCLUSION AND FUTURE WORK 


We have presented an automated approach to construct 
explicit safe regions for convex polygons moving in the plane 
with piecewise equations of motion of the form y = f(x) and 
x = f(y). We have also proved the equivalence of the implicit 
and explicit formulations of the safe region, discussed an au- 
tomated implementation of our method, and benchmarked the 
performance of our method compared to quantifier elimination 
using cylindrical algebraic decomposition. 

We would like to study how our method extends to objects 
translating in 3 or n dimensions; we conjecture that active 
corners will become active edges in three dimensions (and 
(n — 2)-polyhedra in n dimensions). Additionally, we would 
like to expand to handle trajectories in the form of inequalities; 
rotating objects; and invariants of differential equations of the 
form f(x,y) = O rather than explicit trajectories. On the 
implementation side, we are currently exploring how to au- 
tomatically output a machine-checkable proof of equivalence 
between the implicit and explicit formulations, using the PVS 
theorem prover. 
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Abstract—Pushdown automata are an essential model of 
recursive computation. In model checking and static analysis, nu- 
merous problems can be reduced to reachability questions about 
pushdown automata and several efficient libraries implement 
automata-theoretic algorithms for answering these questions. 
These libraries are often used as core components in other tools, 
and therefore it is instrumental that the used algorithms and 
their implementations are correct. We present a method that 
significantly increases the trust in the answers provided by the 
libraries for pushdown reachability by (i) formally verifying the 
correctness of the used algorithms using the Isabelle/HOL proof 
assistant, (ii) extracting executable programs from the formaliza- 
tion, (iii) implementing a framework for the differential testing of 
library implementations with the verified extracted algorithms as 
oracles, and (iv) automatically minimizing counter-examples from 
the differential testing based on the delta-debugging methodology. 
We instantiate our method to the concrete case of PDAAAL, 
a state-of-the-art library for pushdown reachability. Thereby, 
we discover and resolve several nontrivial errors in PDAAAL. 


I. INTRODUCTION 


In 1964, Büchi [7] proved that the possibly infinite set of all 
reachable pushdown configurations (from a given initial con- 
figuration) can be effectively described by a regular language. 
In fact, even for a given regular set of pushdown configura- 
tions, its post* and pre* closures (representing all forward 
and backward reachable configurations from a given set of 
configurations) are also regular. Biichi’s automata-theoretic 
approach gave rise to a rich theory of pushdown reachability 
with numerous algorithms and applications to, e.g., interpro- 
cedural control-flow analysis of recursive programs [9], [11], 
model checking [4], [13], [45], [46], communication network 
analysis [10], [21], [22] and others. A number of tools have 
been developed to support the theory, including Moped [45], 
[46], WALi [25], and PDAAAL [23] with applications ranging 
from the static analysis of Java [46] and C/C++ code [26], 
[43] to the analysis of MPLS communication protocols [22]. 

Even though the automata-theoretic approach for pushdown 
reachability is based on relatively simple saturation proce- 
dures, the proofs of correctness are nontrivial and the imple- 
mentation of the algorithms in the different tools often includes 
numerous performance optimizations as well as additional 
improvements to the theory itself [23]. To be able to rely on 
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the output of model checking tools and other applications of 
pushdown reachability, it is important that the theory is not 
only sound but also correctly implemented. A positive reach- 
ability answer is typically accompanied by a finite evidence 
(trace) that can function as an efficiently checkable certificate. 
A negative answer is, on the other hand, much harder to 
check, and designing a finite evidence for non-reachability is 
difficult, primarily because the number of reachable pushdown 
configurations can be infinite. One approach is to establish 
an invariant that (i) includes the initial configuration(s) of 
the system, (ii) is maintained by the transition relation and 
(iii) has an empty intersection with the set of undesirable 
configurations. Such approaches have been studied [16], [17] 
but are usually incomplete and require another complex tool 
(that can be error-prone, too) to verify such invariants. 


We instead use a proof assistant, Isabelle/HOL [37] ($ID, to 
formally verify the correctness of the pushdown reachability 
algorithms post* (forward search), pre* (backward search), 
and dual™ (bi-directional search) (§III) that lie at the heart of 
the automata-theoretic analysis of pushdown systems [4], [23], 
[44]. From the formalization of pre*, we extract an executable 
program with strong correctness guarantees (§IV). For a given 
input, the extracted program’s output can be compared with 
the output of other, unverified but optimized tools solving the 
same problem (§V). This approach is known as differential 
testing [14], [18], [34] with a twist that the testing oracle 
has been formally verified and thus is extremely trustworthy. 
When testing reveals a disagreement between a verified 
and an unverified algorithm, we know who is to blame. To 
help localize errors in unverified algorithms, we minimize 
the tests causing disagreement using the delta-debugging 
technique [51]. Our main contributions are as follows. 


— The formalization of post*, pre* and dual” algorithms in 
Isabelle/HOL and verification of their correctness based 
on the proofs provided by Schwoon [44] for post* and 
pre*, and following Jensen et al. [23] for dual”. 

— The refinement to and the extraction of an executable 
program of the formalized pre* algorithm that serves as 
the verified oracle for differential testing. 

— The automatic minimization of the input automata in 
cases where an unverified tool disagrees with the oracle. 
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— The application of our method to a modern state-of- 
the-art library for pushdown reachability, PDAAAL [23], 
and the identification, localization (using the minimized 
counter-examples), and correction of three, previously 
unknown, implementation errors (§VI). The corrected 
implementation passes all differential tests successfully. 

Our Isabelle formalization as well as the case study are 
publicly available [40]. 

a) Related work: Differential testing with a verified 
oracle has been used in the context of runtime verification and 
automatic theorem proving. The runtime monitor VeriMon [2], 
[41] served as the verified oracle used to detect errors in unver- 
ified monitors. Compared to our approach, VeriMon’s differen- 
tial testing case study is from a different application domain, 
does not include exhaustive test generation for small input 
sizes (which is difficult in runtime monitoring) and does not 
minimize the tests automatically. To assess its performance but 
also to evaluate the benchmark’s correctness, the verified first- 
order prover RPx [39] was evaluated on a standard benchmark 
for first-order logic problems. RPx’s answers have in all cases 
coincided with the expected ones recorded in the benchmark. 

The verified C compiler CompCert [32] and several verified 
distributed systems [20], [33], [48] have been themselves put 
onto the testbed [15], [50]. A few errors in these tools’ unveri- 
fied parts or in scenarios violating the verification assumptions 
were found, but none in the verified components themselves. 

Many works extract efficient executable code from formal- 
izations, but do not use it as an oracle in testing. Examples 
include verified model checkers for LTL [12] and timed 
automata [49] and verified algorithms for finite automata [3], 
[6], [24], [31] and context-free grammars [35], [38]. 

The only formalization of pushdown automata we are 
aware of is part of Lammich et al.’s work on dynamic 
pushdown networks (DPN) [30]. Lammich describes the 
Isabelle formalization of an executable pre* algorithm for 
DPNs stemming from this work in an unpublished technical 
report [29]. DPNs generalize pushdown automata, but their 
post* is not regular [5] and so we cannot extend this work for 
our purposes. Moreover, Lammich’s formalization does not 
support é-transitions in the underlying automata, an essential 
component needed for our formalization of post* and dual”. 

b) Background definitions: Let P be a finite set of con- 
trol locations and Ia finite stack alphabet. A pushdown system 
(PDS) is a tuple (P,T, A), where A C (P xT) x (P x I*) 
is a finite set of rules, written (p,y) © (g,w) whenever 
((p, y), (q,w)) € A. Without loss of generality, we assume 
|w| < 2, so that w = £ represents a pop operation that removes 
the topmost stack symbol, |w| = 1 is a swap that replaces the 
topmost symbol with another one, and |w| = 2 is a push that 
incorporates a swap followed by adding a new symbol on top. 

A configuration of a pushdown system is a pair (p, w) of the 
current control location p € P and the current stack content 
w €E I* where we assume that the top of the stack is on the left. 
The set of all configurations is denoted by C. A PDS can take a 
computation step (p, yw’) = (q, ww’) between configurations 
whenever (p, y) —> (q,w) and w’ € I*. For a given C CC, 


we define post* (C) = {c € C | c =>* œ for some c € C} 
and pre*(C) = {c € C | c =* c for some c’ € C}. 

The reachability problem for PDSs is to decide whether 
c =* c for configurations c and c’, and it is equivalent 
to asking whether c’ € post*({c}) or equivalently whether 
c € pre*({c'}). Biichi [7] showed that for any regular set 
C CC, the sets post*(C) and pre*(C) are also regular. 

To represent regular sets of pushdown configurations, we 
use P-automata [44], which are nondeterministic finite au- 
tomata with multiple initial states for each of the control 
locations from the set P. Formally, let N be a finite set of 
noninitial states and F C PUN a finite set of final states. A P- 
automaton is a tuple A = (PUN, —, P, F) with the transition 
relation > C (PU N) xT x (PUN) so that PUN is the set 
of its states and the pushdown alphabet I is the input alpha- 
bet of the automaton. The language L(A) of P-automaton 
A contains the pushdown configurations accepted by A: a 
configuration (p, w) € PxI™ is accepted if and only if there is 
a path from p to q for some q € F in the P-automaton (defined 
via the transition relation —) labelled with w. The reachability 
problem for P-automata is as follows: given a PDS (P,T, A) 
and P-automata A, and A», does there exist c € L(A) and 
œ € L(A2) such that c >* c’ using the rules A? 


II. ISABELLE/HOL 


Isabelle/HOL [37] is a proof assistant based on classical 
higher-order logic (HOL), a simply typed lambda calculus with 
Hilbert choice, axiom of infinity, and rank-1 polymorphism. 
We present our formalization using HOL’s syntax, which 
mixes functional programming and mathematical notation. 

Types are built from type variables ‘a, 'b, ... and type 
constructors like pairs _ x _ and functions _ => _ (both written 
infix) and sets _ set (written postfix). Type constructors can 
also be nullary, e.g., the Boolean type bool. Type variables 
can be restricted by type classes: ‘a :: finite is a type 
variable ‘a that can only be instantiated with finite types (i.e., 
types with finitely many inhabitants). New type constructors 
are introduced as abbreviations for complex type expressions 
and as inductive datatypes using commands type_synonym 
and datatype respectively, e.g., the types of transitions 
type_synonym ('state, ‘label) transition = ‘state x ‘label x 
‘state and finite lists datatype ‘a list = |] | ‘a # (‘a list). 

Terms are built from variables x, y, ..., constants c, d, ..., 
lambda abstractions Ax. t and applications written as juxta- 
position f x. Isabelle includes many constants and syntax for 
them, e.g., infix operators ^, V, —>, <—, €, U, N, unbounded 
and bounded quantifiers 3x. P x and Vy € A. Q y, and set 
comprehensions {x. P x}. Non-recursive functions are de- 
fined and given readable syntax using the definition command: 

definition image (infix ‘) where 

fi A=fty. dae A y=f a} 
Type annotations like image :: (‘a = 'b) = ‘a set => 'b set 
can be omitted as they are inferred. Recursive definitions are 
supported using the fun command: 

fun append (infix @) where 

[| @ ys = ys | (z # xs) @ ys = z # (ws @ ys) 
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locale LTS = fixes trans_rel :: (‘state, ‘label) transition set 
begin 
definition step_relp (infix =) where 
c> d 4 (al. (c,l, cd) € trans_rel) 
definition step_starp (infix =>*) where 
c=>* d <-> step_relp™* cc 
definition pre_star C = {c’. 3c € C. d =* c} 
definition post_star C = {c’. 3c € C. c >* c'} 


definition srcs = {p. Aq y. (q, y, p) € trans_rel} 
definition sinks = {p. fq y. (p, Y, q) € trans_rel} 
inductive_set trans_star where 
(p, |], p) € trans_star 
| (p, y, q) € trans_rel — (q',w,q) € trans_star — 
(p, y#w, q) € trans_star 
end 


Fig. 1: The locale for labeled transition systems 


Internally, fun performs an automatic termination proof. More 
complex recursion schemes may require a manual proof. 
Another way to define a function is as Prolog-style mono- 
tone rules. The inductive command allows such definitions as 
least fixed points. Take, e.g., the reflexive transitive closure: 


inductive rtranclp (_**) where 
R’acxr| Rey R* yz RR xz 


Theorems and lemmas are terms of type bool that have 
been proved to be equivalent to True. All proofs pass through 
Isabelle’s kernel, which relies only on a few well-understood 
reasoning rules such as modus ponens. We refer to a text- 
book [36] for a practical introduction to proving in Isabelle. 

Structures and assumptions common to many theorems can 
be organized via locales [1]—Isabelle’s module mechanism 
for fixing parameters and stating and assuming their 
properties. In the context of a locale, the parameters are 
available as constants and the assumptions as facts. Locales 
can be interpreted, which involves instantiating the parameters 
and proving the assumptions. As the result, one obtains the 
(instantiated) theorems proved in the context of the locale. 


Consider our locale for labeled transition systems (LTSs) in 
Fig. 1. It fixes the parameter trans_rel, and its context consists 
of the definitions between the begin and end keywords. 
All definitions should be self-explanatory except perhaps 
trans_star: the set of triples (p,w,q) for which the LTS can 
move from p to q by consuming word w. This relation is de- 
fined inductively, first for the empty sequence and then extend- 
ing it by one more symbol—here we use in conjunction two 
assumptions on the symbol y and sequence w. (Following an 
Isabelle convention, we formalize it equivalently as two impli- 
cations.) In the formalization, the locale has more definitions 
than shown here and a number of lemmas. Outside LTS’s con- 
text, we can access its definitions, e.g., pre_star is available un- 
der the name LTS.pre_star and can be applied to any transition 
relation A and a set of states C as follows: LTS.pre_star A C. 


datatype ‘label op = pop | swap ‘label | push ‘label ‘label 
type synonym (‘ctr_loc, ‘label) rule = 

(‘ctr_loc x 'label) x (‘ctr_loc x ‘label op) 
type synonym (‘cir_loc,‘label) conf = 'ctr_loc x ‘label list 


locale PDS = fixes A :: (‘ctr_loc, 'label:: finite) rule set 
begin 
fun Ibl where 
Ibl pop = [] | Ibl (swap 7) = [h] | Ibl (push y 7’) = [7,1] 
definition is_rule (infix —) where 
(p, 7) => (p',w) <> ((p,7), œ, w)) € A 
inductive_set step where 
(p, y) X (p', w) = 
((p, y # w’), (), (p, Ibl w @ w’) € step 
interpretation LTS step . 


end 


datatype ('ctr_loc, 'noninit) state = 
Init ‘ctr_loc | Noninit 'noninit 


locale PDS_with_finals = PDS A 
for A :: (‘ctr_loc :: enum,’ label :: finite) rule set + 
fixes F_inits :: ‘ctr_loc set and F_noninits :: ‘noninit set 
begin 
definition finals = Init ‘ F_inits U Noninit ‘ F_noninits 
definition inits = {q. Ip. q = Init p} 
definition accepts A (p, w) = 
(Aq € finals. (Init p, w, q) € LTS.trans_star A) 
definition lang A = {c. accepts A c} 


end 
Fig. 2: The types and locales for pushdown systems 


III. PUSHDOWN REACHABILITY 


We formalize pushdown systems (PDSs) and saturation 
algorithms for calculating pre* and _ post* following 
Schwoon [44] and dual* following Jensen et al. [23]. 

Fig. 2 shows our modeling of PDSs. We use type 
variables to represent control locations (‘ctr_loc) and 
stack labels (label). We introduce types for operations 
(‘label op), rules ((‘ctr_loc, label) rule) and configurations 
((‘ctr_loc, label) conf). A PDS is given by the locale PDS, 
which fixes a set of rules A. Each PDS gives rise to an unla- 
beled transition relation, which we model by an LTS step with 
label ()—the only element of type unit. The definition is a 
non-recursive inductive definition. We use the interpretation 
command to interpret LTS with step. This means that pre_star 
refers to LTS.pre_star step in PDS. Likewise, trans_star refers 
to LTS.trans_star step and similarly for other LTS definitions. 
The type (‘ctr_loc,’/noninit) state represents P-automata 
states, where ‘noninit is the type variable for noninitial states. 
The locale PDS_with_finals extends PDS with a set of final 
initial states F_inits and final noninitial states F_noninits. For 
the rest of this section, we work within the PDS_with_finals 
locale. In this locale, a P-automaton is a set of transitions. 
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Yo Yo 
— 
Ny yT: 
TOF a P; = Init p: for i € {0,1,2} 
r2 


Qi = Noninit q; for i € {1,2} 


definition A = {((p2, 72), (po, pop)), 
((p1, 71); (P2, push y2 Yo))} 
definition A = {(Po,70,Q1), (Q1, 70, Q2)} 


Fig. 3: Adding two transitions (dashed arrows) to a 
P-automaton. Initially (solid arrows) the P-automata encodes 
only configuration (po, [Yo,Yo]). After saturation, the configu- 
rations (p1, [71,7o0]) and (pe, [Y2, Y0, Yo]) are also encoded. 


A. Nondeterministic pre* Saturation 


Schwoon [44] presents the pre* saturation which is a 
nondeterministic algorithm that given a P-automaton A 
returns a P-automaton whose language is pre_star (lang A). 
The algorithm proceeds by iteratively adding transitions to 
A. In each step, the algorithm nondeterministically chooses 
a transition to add that satisfies a number of criteria. The 
P-automaton is saturated when no more transitions can be 
added. We formalize a step of the algorithm by the relation: 

inductive pre_star_rule where 

(Init p, 7,4) € A —> (p, 7) > (pP, w) —> 
(Init p’, Ibl w, q) € LTS.trans_star A —> 
pre_star_rule A (AU {(Init p, y,q)}) 

The pre_star_rule relation relates two P-automata if the 
latter can be obtained from the former via one step of the 
algorithm. The criteria of the algorithm are expressed as the 
premises of the implication shown in pre_star_rule’s definition. 
The last two premises are taken directly from Schwoon’s 
definition of the algorithm and the first one ensures that the 
transition we add into the new P-automaton is a new one. A 
single P-automaton can be related to different P-automata via 
pre_star_rule, which captures nondeterministic choice. 

Consider the PDS defined by A in Fig. 3, and let the 
P-automaton A consist of the two solid transitions in the 
figure. Let A’ be AU{ (P2, Y2, Po)}. Notice that (P2, y2, Po) ¢ 
A and (p2,%2) © (po,pop) and (Po,lbl pop, Po) € 
LTS.trans_star A. From pre_star_rule’s definition then follows 
that pre_star_rule A A’. Let A” be A’ U {(P1, %1, Qi}). From 
pre_star_rule’s definition it follows that pre_star_rule A’ A”. 

We formalize what it means for a P-automaton A to be 
saturated w.r.t a rule r, and for A’ to be a saturation of A: 

definition saturated r A = (fA’. r A A’) 

definition saturation r A A’ = (r** A A’ ^ saturated r A’) 
In our example, A” is saturated and thus formally we have 
saturated pre_star_rule A” and saturation pre_star_rule A A”. 

We next prove the pre* saturation algorithm correct. Here, 
we focus on the proof’s most interesting aspects, especially 
those where we had to deviate from Schwoon’s pen-and-paper 
proof, and refer to our formalization for full details [40]. 

The correctness theorem states that if a transition system 
A’ is a saturation of a transition system A then the language 


of A’ is indeed the pre* closure of the language of A. Like 
Schwoon, we assume that the initial states are sources: 
theorem pre_star_rules_correct: 
assumes inits C LTS.srcs A 
and saturation pre_star_rule A A’ 
shows lang A’ = pre_star (lang A) 
Schwoon’s Lemma 3.1 is used to prove the > direction of the 
theorem’s conclusion. He proves it by considering an arbitrary 
predecessor configuration (p’,w) of a configuration (p,v) in 
A’s language. The proof proceeds by induction on the number 
of => transitions from (p’, w) to (p,v). We do not keep track 
of this number, but we instead prove the lemma by induction 
on the transitive and reflexive closure of =. The formalization 
of the proof is written in Isabelle’s structured proof language 
Isar (not shown) and follows Schwoon’s arguments. 
Schwoon’s Lemma 3.2 is used to prove the C direction of 
pre_star_rules_correct’s conclusion. We showcase Lemma 3.2 
in Schwoon’s formulation, but adapted to our notation: 
Lemma 3.2 If saturation pre_star_rule A A’ and 
(p, w,q) € LTS.trans_star A’ then: 


(a) (p, w) =* (p', w’) for a configuration (p’, w’) such that 
(p',w',q) € A; 
(b) moreover, if q is an initial state, then w’ = []. 
In his proof, Schwoon claims to prove (a) by an induction and 
then that (b) will follow immediately from a simple argument. 
However, reading his proof we notice that he uses (b) in the 
proof of (a). We resolve this by noticing that we can strengthen 
(b) to hold for any stack w and not just the one w’ claimed 
to exist in (a). Our formulation of (b) looks as follows: 
lemma _ word_into_init_empty: 
assumes (p, w, Init q) € LTS.trans_star A 
and inits C LTS.srcs A 
shows w = || A p = Init q 
We prove (a) using the strengthened version of (b). Like 
Schwoon, we prove (a) by a nested induction. His outer 
induction is on the number of times the algorithm added 
transitions to the P-automaton. We instead prove the lemma by 
induction on the transitive reflexive closure of pre_star_rule. 
The inner induction is more challenging to formalize. Here, 
Schwoon considers a specific transition t which he defines 
as the ith transition added to P-automaton A. In the same 
context he considers a word w and two states, Init p and 
q, such that (Init p,w,q) € LTS.trans_star A’. He then 
defines j as the number of times t is used in (Init p, w,q) € 
LTS.trans_star A’. We may argue that this number is not well- 
defined, because there can be several paths from Init p to q 
consuming w, and on these paths t may not occur the same 
number of times. It turns out we can choose among these paths 
completely freely—any one of them will work, and so we just 
choose one arbitrarily. Formalizing this required us to define a 
variant of trans_star that keeps track of the intermediate states. 


B. Nondeterministic post* Saturation 


We call states with no incoming or outgoing transitions 
isolated. The post* saturation algorithm requires the addition 
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of new noninitial states that are isolated in the automaton 
on which the algorithm is run. Under certain conditions the 
algorithm adds transitions into and out of these. Each such new 
state corresponds to a control location and a label. We extend 
the datatype of states with a new constructor Isolated for these: 
datatype (‘ctr_loc,noninit, ‘label) state = 
Init ‘ctr_loc | Noninit ‘noninit | Isolated ‘ctr_loc ‘label 


Moreover, we define isols = {q. dp. q = Isolated p}. 
Steps in the post* saturation are formalized as follows: 
inductive post_star_rules where 
(p, 7) => (pP, pop) — (Init p’,e,q) ¢ A — 
(Init p, [y], q) € LTS_e.trans_star_e A —> 
post_star_rules A (A U {(Init p’, , q)}) 
| (p, y) > (p', swap 7) — (Init p’, Some 7',q) ¢ A — 
(Init p, [y], q) € LTS_e.trans_star_e A —> 
post_star_rules A (AU {(Init p’, Some 9’, q)}) 
| (p, 7) = @, push y y”) — 
(Init p’, Some 7’, Isolated p’ y') ¢ A —> 
(Init p, [y], q) € LTS_e.trans_star_e A —> 
post_star_rules A (A U (Init p’, Some 4’, Isolated p’ y’)) 
| (p,7) => (p, push 7'7”) —> 
(Isolated p’ y’, Some 7”, q) A —> 
(Init p’, Some 4’, Isolated p’ ~’) € A —> 
(Init p, [y], q) € LTS_e.trans_star_e A —> 
post_star_rules A (A U {(Isolated p’ y’, Some 7”, q)}) 
The relation has one rule for pop, one for swap, and two 
for push. It uses LTS_e.trans_star_e, which is similar to 
LTS.trans_star but allows ¢-transitions that do not consume 
stack symbols. The transition (Init p’,¢,q) is an ¢-transition 
and (Init p’, Some 7’,q) is a y'-labeled non-e-transition. The 
function lang_e returns the language of a P-automaton with 
€-transitions. We prove post* saturation correct: 


theorem post_star_rules_correct: 
assumes saturation post_star_rules A A’ 
and inits C LTS.srcs A and isols C LTS.isolated A 
shows lang_e A’ = post_star (lang_e A) 

Schwoon’s definition of the post* rule has only one rule 
for push (in contrast to our two rules). In his rule, Schwoon 
first adds a transition (Init p’,Some 7’, Isolated p’ y’) and 
then adds a transition (Isolated p’ y’, Some yy”, q). Consider 
his rule here presented in his formulation but our notation: 

If (p, y) = (p', push 7” y”) and 

(Init p, y, q) € LTS_e.trans_star_e A, 

first add (Init p’, Some 7’, Isolated p’ y’); 

then add (Isolated p’ y’, Some 7”, q). 

We were at first surprised that he specified this first/then order, 
but his correctness proof actually relies on it. Specifically, the 
order is used in his proof of Lemma 3.4, which is the key to 
prove the > direction of post_star_rules_correct. We present 
Lemma 3.4 in Schwoon’s formulation but our notation: 
Lemma 3.4 If saturation post_star_rules A A’ and 
(Init p, w,q) € LTS_e.trans_star_e A’ then: 


(a) if q ¢ isols, then (p’, w’) =* (p, w) for a configuration 
(p’, w’) such that (Init p’, w’, q) € LTS_e.trans_star_e A; 


(b) if q = Isolated p’ y’, then (p’, y) >* (p, w). 

Schwoon’s proof is a nested induction. The outer induction 
is on the number of transitions post* has added. The induction 
step proceeds by an inner induction on the number of times the 
most recently added transition t was used in (Init p, w,q’) € 
LTS_e.trans_star_e A’. (We resolve the ambiguity of that 
number’s meaning in a similar way as for pre*.) The proof 
then proceeds by a case distinction on which of the post* sat- 
uration rules added t. Consider the case where t was added by 
the “first” part of the rule for push. In this case, t has the form 
(Init p’,Some +’, Isolated p’ y’). Schwoon states that “Then 
since Isolated p’ 7 has no transitions leading into it initially, it 
cannot have played part in an application rule before this step, 
and t is the first transition leading to it. Also, there are no tran- 
sitions leading away from t so far.” Had Schwoon not forced 
the algorithm to first add the transition into Isolated p’ 7’ and 
then add the one out of it, then he could not have claimed 
that there are no transition leading away from t. We capture 
this idea in the following two lemmas, stating that if t is not 
present, then Isolated p’ +’ must be a source and a sink: 


lemma post_star_rules_Isolated_source_invariant: 
assumes post_star_rules** A A’ 
and isols C LTS.isolated A 
and (Init p’, Some 4’, Isolated p’ y’) ¢ A’ 
shows Isolated p’ y’ € LTS.srcs A’ 


lemma post_star_rules_Isolated_sink_invariant: 
assumes post_star_rules** A A’ 
and isols C LTS.isolated A 
and (Init p’, Some 4, Isolated p’ y’) ¢ A’ 
shows Isolated p’ y’ € LTS.sinks A’ 


Formalizing Schwoon’s push rule as a single rule in 
post_star_rules does not capture the order in which the two 
transition are added to the set. This is why we split the rule 
in two—one adding the transition into the new noninitial 
state and another adding the transition out of the new non- 
initial state. This does not yet impose the needed first/then 
order. However, we can impose the order by letting the 
latter rule be only applicable if the transition added by the 
former is indeed already in the automaton. This is possible 
because the transition added into state Isolated p’ y’ is 
(Init p’,Some y’, Isolated p’ y’), and thus we can refer to 
the states comprising this transition in any context where 
Isolated p’ y’ is available, in particular, the second push rule. 
Note that our post* saturation algorithm is slightly more gen- 
eral than Schwoon’s as we do not require the transition out of 
the new noninitial state to be added immediately after the tran- 
sition into it, rather we allow this to happen at any time after. 


C. Combined dual* Saturation 


We now consider the recent bi-directional search approach, 
called dual* [23]. With dual* we can check if the configu- 
rations of one P-automaton A» are reachable from another 
P-automaton A, by alternating between saturating A> towards 
its pre* closure and A; towards its post* closure, while 
simultaneously (on-the-fly) keeping track of their intersection 
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fun (in LTS) reach where 
reach p |] = {p} 
| reach p (y#w) = (Ud' € (UP, 7,7’) E step. 
if p' =pA7/ =7 then {q'} else {}). reach q' w) 
definition (in PDS) pre_starl A = (U((p, 7), (p’, w)) € A. 
Uq € LTS.reach A (Init p’) (Ibl w). {(Init p, y,q)}) 


definition (in PDS) pre_star_exec = the o while_option 
(As. s U pre_starl s Æ s) (As. s U pre_star1 s) 


Fig. 4: Executable pre* 


automaton. As soon as the intersection automaton becomes 
nonempty, we know that there is a state in A» that is reachable 
from A. This is the case even if the pre* and post* automata 
are not saturated. Our correctness theorem is formalized here: 


theorem dual_star_correct_early_termination: 
assumes inits C LTS.srcs A, and inits C LTS.srcs A» 
and isols C LTS.isolated A; MLTS.isolated A» 
and post_star_rules*” A; Aj and pre_star_rule™* A> A% 
and lang_¢_inters (inters_e A, (LTS_e_of A5)) 4 {} 
shows Jc; € lang_e Ay. deo € lang Ag. cy >* co 


The function LTS_¢_of trivially converts a P-automaton 
to a P-automaton with ¢-transitions. The function inters_e 
calculates the intersection P-automaton with e-transitions 
of two P-automata with e-transitions using a product 
construction. The function lang_¢_inters gives the language 
of an intersection automaton. Since the C directions of 
pre_star_rule_correct and post_star_rules_correct do not rely 
on A’ being saturated we prove them assuming only respec- 
tively pre_star_rule** A> A‘ and post_star_rules** A, A‘ 
instead of saturation pre_star_rule Ag AS and 
saturation post_star_rules A, A‘. We use these more 
general lemmas to prove dual_star_correct_early_termination. 


IV. EXECUTABLE PUSHDOWN REACHABILITY 


To get an executable algorithm for pre*, we resolve the non- 
determinism by defining a functional program pre_star_exec, 
presented in Fig. 4 (where we indicate the corresponding 
locale for each definition), with this characteristic property: 


theorem pre_star_exec_language_correct: 
assumes inits C LTS.srcs A 
shows lang (pre_star_exec A) = pre_star (lang A) 


The function reach is trans_star’s executable counterpart: for 
a state p and a word w, reach p w computes the set of states 
reachable from p via w using step (fixed in the LTS locale). In 
other words, we have q € reach p w iff (p, w, q) € trans_star. 

The definition of pre_star_exec uses while_option, the func- 
tional while loop counterpart. Given a test predicate b, a loop 
body c and a loop state s, the expression while_option b c s 
computes the optional state Some (c (---(c (c s)))) not 
satisfying b with the minimal number of applications of c, or 
None if no such state exists. Our specific loop keeps adding the 
results of a single step pre_starl to the P-automaton compris- 
ing the loop state. We prove that our loop never returns None, 


definition nonempty A P Q = 
(Ap € P. Aq € Q. Aw. (p, w,q) € trans_star A) 


definition inters A B = 

{((p1, p2), Ww, (q, q2)). (p1, Ww, qı) € AN (p2, w, q2) € Bh 
definition nonempty_inter A A, F, F” Ay Fy F# = 

nonempty (inters A; (pre_star_exec A A2)) 

((Aw. (x, x)) ‘ inits) (finals Fy F” x finals Fp F3") 

definition check A Ay Fi Fii Ag Fy yi = 

(if sinits C LTS.srcs Aj then None 

else Some (nonempty_inter A A, Fı F” Ag Fo Fi) 


Fig. 5: Reachability check for P-automata 


i.e., it always terminates. We thus use the, defined partially as 
the (Some x) = x, in pre_star_exec to extract the resulting 
P-automaton. The step pre_starl computes the set of all transi- 
tions that can be added by a single application of pre_star_rule. 

Fig. 4’s definitions are executable: Isabelle can inter- 
pret them as functional programs and extract Standard ML, 
Haskell, OCaml, or Scala code [19], but it is usually not possi- 
ble to extract code for inductive predicates (such as trans_star 
or the transitive closure in saturation) or definitions involving 
quantifiers ranging over an infinite domain (as in saturated). 
The definition of pre_star_exec has an obvious inefficiency. In 
every iteration, pre_starl1 is evaluated twice: once as a part of 
the loop body and once as a part of the test. Instead we use 
the following improved equation, which replaces while_option 
with explicit recursion, for code extraction. 


lemma pre_star_exec_code[code]: 
pre_star_exec s = (let s’ = pre_starl s in 
if s’ C s then s else pre_star_exec (s U s’)) 


With the executable algorithm for pre*, we decide the 
reachability problem for P-automata using the check function 
shown in Fig. 5. It inputs a PDS A along with two P-automata 
represented by their transition relations (A; and Ag), their 
final initial states (fF and F>) and their final noninitial states 
(Ft and F%'). The computation proceeds by intersecting 
(inters) the initial P-automaton with the pre* saturation of 
the final P-automaton and checking the result’s nonemptiness 
(nonempty). Fig. 5 refers to functions pre_star_exec, inits, 
finals, and trans_star which we introduced earlier in the 
context of different locales, outside of the respective locale. 
Therefore, these functions take additional parameters that 
correspond to the fixed parameters of the respective locale if 
they are used by the function (e.g., we write pre_star_exec A 
instead of pre_star_exec for an implicitly fixed A). 

The definition of nonempty is not executable because of the 
quantification over words w. We implement, but omit here, the 
straightforward executable algorithm that starts with the set of 
initial states P and iteratively adds transitions from A until it 
reaches Q or saturates without reaching Q, in which case the 
language is empty since no state in Q is reachable from P. 

Overall, check returns an optional Boolean value, where 
None signifies a well-formedness violation on the final 
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P-automaton: a non-source initial state in Ag. If check returns 
Some b, then b is the answer to the reachability problem for 
P-automata. We formalize this characterization of check by 
the following two theorems (phrased outside of locales). 


theorem check_None: 
check A A; Fy FY! Ag Fy F¥ = None ++ 
ainits C LTS.srcs Ag 


theorem check_Some: 
check A Ay Fy Fr! A Fo Fh! = Some b +> 
(inits C LTS.srcs Ag A (b +> 
(Ap w p' w’. step_starp A (p, w) (p', w) A 
(p, w) € lang Ay Fı F# A (p', w’) € lang Ag Fo F#))) 


V. DIFFERENTIAL TESTING 


Differential testing [14], [18], [34] is a technique for finding 
implementation errors by executing different algorithms solv- 
ing the same problem on a set of test cases and comparing the 
outputs. Differential testing has been effective for finding er- 
rors in a wide range of domains, from network certificate vali- 
dation [47] to JVM implementations [8]. Yet, even different al- 
gorithms do not necessarily fail independently, e.g., when built 
from the same specification [27] or when sharing potentially 
faulty components, e.g., input parsers or preprocessing. To re- 
duce the danger of missing such errors, we suggest to incorpo- 
rate a formally verified implementation in differential testing. 
Moreover, in case of a discrepancy the verified oracle reliably 
tells us which of the unverified implementations is wrong. 


A. Differential Testing of Pushdown Reachability 


Our executable formalization of pushdown reachability 
allows us to perform differential testing on unverified tools 
for the same problem. A test case for pushdown reachability 
consists of a PDS with rules A and two P-automata A; and 
Ap» representing the initial and final configurations of interest. 
The answer to the test case is whether there exist c € L(A,) 
and c’ € L(A2) such that c >* c using the rules A. 

To execute the formalization on a given test case, we 
generate an Isabelle theory file, which first defines the control 
locations, labels, and automata states as finite subsets of the 
natural numbers (their sizes depending on the specific test 
case), and then includes for the pushdown rules A and the two 
P-automata, each represented by its transitions A; along with 
the accepting (initial and noninitial) states F; and pni for i € 
{1,2}. Fig. 3 shows a specific example of A and A definitions. 

We generate a lemma that uses our check function, where 
the expected result Some True or Some False is inserted 
depending on the answer produced by an unverified tool under 
test (invoked before generating the theory on the same inputs): 


lemma check A A; Fy Fa A2 Fo ES = Some True by eval 


The eval proof method extracts Standard ML code for check 
and other constants in the lemma and executes the lemma 
statement as an expression. It succeeds iff the lemma evaluates 
to True. We call a test case a counter-example, if the proof 
method fails. One could also run the extracted code outside 


Input: Reachability tools tool and oracle, PDS (P,T, A), 
P-automata A;=(PU Ni, >i, P, F;) for i € {1,2}. 
Output: Minimal counter-example (failing testcase) 
1: cf AU({1} x (91UF)))UY2} x (S2UF))) 
> Convert to a set of features 
2: return DD(c, 2) > returned set of features can be 
converted to PDS and P-automata as on lines 10-11 


: function DD(c, n) > cis a test case, n is granularity 
let cj W- +- Wc, = cC, all c; as evenly sized as possible 
if 3i. BAD(c;) return DD(c;, 2) 

else if 3i. BAD(c\c;) return DD(c\c;, max(n—1, 2)) 
else if n < |c| return DD(c, min(2n, |c|)) 

else return c 


PO ON RS 


9: function BAD(c) > cis a test case 
10: let A’ =cNA p extract PDS rules and P-automata 
11: for i € {1,2} let A; = (P U Ni, >i, P, F!) where 

>; = {t € >; | (i,t) € c} and F? = {q € F; | (i,q) € c} 


12: with both tools check if A{ reaches A} via (P, T, A’) 
13: return false if tool and oracle agree, else true 


Algorithm 1: Specialization of delta-debugging [51] to PDS. 


Isabelle, but our setup allows us to generate the inputs to check 
on the formalization level instead of that of the extracted code. 

To efficiently check a large number of test cases, we batch 
multiple definitions and lemmas into one theory file, thus 
reducing the overhead of starting Isabelle. We run Isabelle 
from the command line and check the output log for any failing 
eval proofs, which correspond to failing test cases. 


B. Automatic Counter-Example Minimization 


If differential testing finds a failing test case, we use delta- 
debugging [51] to automatically reduce it to a minimal failing 
test case to help the subsequent debugging process. We use 
the minimizing delta debugging algorithm [51] that sees a test 
case as a set of features, and works by systematically testing 
different subsets until a minimal failing test case is found. 

We use delta debugging on any discovered counter-example 
and fix the set of features to contain: (i) each pushdown rule, 
Gi) each transition in either of the P-automata, and (iii) each 
final state in a P-automaton (as opposed to it not being final). 

States and labels are identified by unique names, and the 
initial P-automata states are exactly the states mentioned in 
any pushdown rule in the feature set. We specialize the general 
delta debugging algorithm to pushdown systems as shown in 
Algorithm 1. The algorithm first creates the set of features and 
calls the recursive function DD with this set of features and the 
granularity 2. The function then splits the set of features into a 
number of equally sized subsets (according to the granularity) 
and checks if any of these subsets or their complements 
still fail. If yes, then the function tries to recursively reduce 
the set of features further, otherwise it will increase the 
granularity and try again. The function BAD converts the set 
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A={ (po,B)— (p2,push D B), p1ı;,B)—(pı1,swap C), 
(po,D)— (po,swap A), (po;C)—> (p2,swap D), pi,E)<+(p1,swap C), 
(po,E)—+>(po,push B E), (po,E)— (p2,swap E), pi,A)—+(p2,swap A), 
(po,D) => (po,E)—> (p3,push B E), pı, D)— (p2,swap D), 
(po,push D D), (po;C)—> (p3,swap E), pi,C)—>(p2,swap E), 
(Po,D)=> (po,Pop), (p1,B)=(po,swap C), p1:C)=(p3,swap D), 
(po,D)=> (p1,swap A), (p1,D)—>(po,swap C), pı, D)—>(p3,pop), 
(po,A)—>(p1,push C A), (p1,C)— (po,swap B), p2,B)<+(po,push A B), 
(Po,E)+(p2,pushAE),  (p1,C)— (Ppo,swap E), p2,A)~+(po,push C A), 


Aı = {(Init po, B, Noninit q1), (Init po, D, Noninit qo), 
(Init p2, B, Noninit qo), (Init p3, A, Noninit q2), 


(Noninit qo, D, Noninit qi), (Noninit q2, C, Noninit qo) } 


A= OG FP = {a} 
A2 = {(Init p2, A, Noninit qo)), (Init p2, B, Noninit qo) } 
Fo = {Pop} F = {} 


(p2,C)= (po,push CC), — (p2,B) + (p2,swap E), (p3,E)—+(p2,swap C), 
(p2,D)—> (p2,E)—+(p3,pushAE), — (p3,B)~+(p2,push DB), 
(Po,push BD), (p2,B)+(p3,pushCB), — (p3,E)+(p3,swap A), 
(p2,C)+(pi,pushCC),  (p3,D) => (p3,A)—= (p3,push C A), 
(p2,A)=(p1,pushB A), — (po,push BD), (p3,E)—+(p3,swap D), 
(p2,A)—+(p2,pushAA), — (p3,C)~+(po,pushEC), — (p3,C)<+(p3,pop) } 
(p2,C)— (p2,swap A), (p3,C)— (Po,swap E), 


(p3,C)<+(pi,push A C), 
(P3,B)<+(Pp1,pop), 


(p2,E)=> (p2,swap A), 
(P2,A)— (p2,push B A), 


A = {(po, D) => (po, pop)} 

Ai = {(Init po, D, Noninit qo), (Noninit qo, D, Noninit qi)} 
Ro o= Q FË = {a} 

A = {} 

Fo = {po} F = {} 


Fig. 6: Original and minimized (bottom right) counter-example 


of features into a reduced pushdown system and two reduced 
P-automata and checks if the given tool implementation is 
still inconsistent with the oracle. We note that minimal failing 
counter-examples are only locally minimal and not necessarily 
unique. Yet, minimization is effective and necessary. Fig. 6 
shows a real bug example we discovered by random 
differential testing in the PDAAAL library for pushdown 
reachability [23] and its minimization by Algorithm 1. 


VI. CASE STUDY: ANALYSIS OF PDAAAL 


We apply differential testing with automatic counter- 
example minimization to PDAAAL [42], a recent C++ imple- 
mentation of pushdown reachability checking, which appears 
to be the currently most efficient library for pushdown reacha- 
bility [23]. PDAAAL implements post*, pre* and dual” [23]. 

These three different algorithms can be used in classical 
differential testing without a verified oracle, but given the large 
amount of shared code this is bound to miss some errors. And 
without a verified oracle, manual effort is needed to determine 
which implementation is faulty in case of discrepancies. This 
motivates using our verified reachability check via pre*, and 
we compare the output of each unverified algorithm to the out- 
put of our trustworthy oracle on a large number of test cases. 


A. Methodology of Test Case Generation 


We structure our test case generation in three phases. 

In phase one, we use real-world tests generated from 
the domain of network verification, which PDAAAL was 
originally built for as a backend [22]. We generate pushdown 
reachability problems from realistic network verification 
use-cases on (up to) 100 random reachability queries on 
each of the 260 different networks derived from the Internet 
Topology Zoo [28] giving a total of 25512 test cases. 

In phase two, we randomly generate valid pushdown sys- 
tems and P-automata. We generate 15000 cases of varying 
sizes with 4 control locations, 5 labels, up to 200 pushdown 
rules, and up to 13 automata transitions. Our generator writes 
all ingredients (pushdown system and P-automata) to a JSON 


file, which is then translated to the Isabelle definitions and 
correctness lemmas that incorporate the unverified answers. 
Finally, in phase three, we exhaustively enumerate the set 
of all test cases up to a certain (small) size. For the pushdown 
systems |P| = || = 2 and |A| < 2, and for P-automata 
|Ni| = 2, |No| = 1 and |>| < 2. We remove symmetric 
cases, where swapping state names or labels gives an identical 
case. In total, this yields close to 27 million combinations of 
pushdown systems and P-automata. For the exhaustive tests, 
we output both JSON files and Isabelle definitions directly 
from the test case generator. A bash script stitches together the 
Isabelle definitions into a single theory file with a batch of test 
cases to benefit from Isabelle’s parallel processing of proofs. 


B. Results 


The real-world test cases showed no discrepancies between 
the verified oracle and PDAAAL. This indicates that PDAAAL 
has already been thoroughly tested on this type of problem 
instances. Isabelle ran out of memory in 30 of the 25 512 test 
cases. The average CPU time (on AMD EPYC 7642 proces- 
sors at 1.5 GHz) per test case was 35 seconds for Isabelle, 
while PDAAAL used less than 0.02 seconds on most cases. 

Phase two, however, resulted in 1334 discrepancies. By 
applying our counter-example minimization, we noted that all 
these cases had a common trait: the P-automaton A» accepted 
the empty word. This helped us find the first implementation 
error in the implementation of the on-the-fly automata inter- 
section when using post*. The post* algorithm can introduce 
€-transitions, which were not handled correctly by the inter- 
section implementation. In most cases, this does not matter, as 
for any é-transition followed by a normal transition the post* 
algorithm adds a direct transition at some later point. However, 
in the case of an empty stack being accepted by Ag, this does 
not happen, which causes the unverified algorithm to return 
the wrong answer False. We resolved the error and re-ran the 
generated tests. After that only one discrepancy remained. 

This second error was found in the implementation of pre*. 
The minimized counter-example helped us find the source 
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10: function ADDTRANSITION(q; a q) 


> with 7 € {1,2} 


11: add q; Ti q; to A; 

12: for all q3_;, i E Q3—i S-t. (q, q2) € R and q3_; Haii hi do 
13: add (q1, q2) > (q4, gh) to An 

14: ADDSTATE(q), 95) 


(a) Snippet of (correct) intersection pseudocode by Jensen et al. [23] 


_automaton.add_edge(from, to, label, trace_ptr_from<W>(trace)); 


_found = _found || _early_termination(from, label, to, trace_ptr_from<W>(trace)); 


_automaton.add_edge(from, to, label, trace_ptr_from<W>(trace)); 


vv 2 EE src/include/pdaaal/Solver.h (0 
cue @@ -119,10 +119,1@ @@ namespace pdaaal { 

119 119 if (res.second) { // New edge is not already in edges (rel U workset). 

120 120 _workset.emplace(from, label, to); 

121 121 if (trace != nullptr) { // Don't add existing edges 
1220 + 

122 123 if constexpr (ET) { 

123 124 

124 125 $ 

125 - 

126 126 } 

127 127 } 

128 128 y 


(b) PDAAAL’s C++ code showing the resolution of the second error 


Fig. 7: Discovered second implementation error and its correct pseudocode 


of the implementation error: the set of automata transitions 
was updated only after calling the function that performs 
the nonemptiness check of the intersection automaton, but it 
should have been updated before that call. We argue that this 
error is subtle, as it only causes a single failure out of 15 000 
randomly generated test cases. Fig. 7a shows the correct 
pseudocode by Jensen et al. [23]. Fig. 7b shows PDAAAL’s 
corresponding C++ code and the change resolving the error, 
where the line that needed to be moved corresponds to the 
pseudocode’s Line 11. 

For both errors, the affected test cases resulted in a correct 
answer for at least one of the other search strategies in 
PDAAAL. This is not the case for the last error, which 
is found in code shared by all three methods, and where 
PDAAAL’s algorithms disagree only with Isabelle. This error 
is caused by a mismatch between the assumptions of the 
parser that builds the pushdown system and the data structure 
that stores the pushdown rules. The parser assumes that it can 
incrementally add rules to the data structure without knowing 
all labels in advance, but the data structure assumes to know 
all labels from the start to implement a memory optimization 
that replaces a rule that applies to all labels by a wildcard. 

For the first two test phases, the program that generated 
Isabelle definitions also depended on this parser, so the bug 
was not discovered until the third phase, which has a different 
setup. After the three bugs were fixed, all test cases pass. 


VII. CONCLUSION 


We presented a methodology that increases the reliability of 
tools and libraries for pushdown reachability analysis. To this 


end, we formalized and proved in Isabelle/HOL the correctness 
of the essential saturation algorithms used in such tools. 
We extracted an executable program from our formalization 
and used it as a trustworthy oracle for differential testing. 
Putting the modern pushdown analysis library PDAAAL on the 
testbed, we discovered a number of implementation errors in 
its code, even though the library performed flawlessly in its ap- 
plication domain. Using our automatic counter-example min- 
imization based on delta-debugging, we were able to identify 
the sources of these errors and suggested fixes to PDAAAL’s 
implementation that now passes all the differential tests. 
This process significantly increased PDAAAL’s reliability 
and shows that with a moderate effort, the combination of 
proof assistants with code generation, differential testing, 
and delta-debugging is highly fruitful. The execution of all 
tests in the three phases took 303 CPU days. We executed 
the tests on a compute cluster with 1536 CPU cores. 
The formalization work took about two person-months for 
experienced formalizers, creating about 4400 nonempty lines 
of Isabelle definition and proofs. An additional half person- 
month of work was needed to implement the differential 
testing and counter-example minimization, set up the tests, 
and localize and resolve the discovered errors. This one-time 
effort will also benefit the future development of PDAAAL. 
Too often, the race for better performance can lead to subtle 
implementation errors. Our methodology shows how formally 
verified algorithms that were not tuned for performance can be 
used to improve the quality of tuned but unverified algorithms. 
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Abstract—TRICERA is an automated, open-source verification 
tool for C programs based on the concept of Constrained Horn 
Clauses (CHCs). In order to handle programs operating on heap, 
TRICERA applies a novel theory of heaps, which enables the tool 
to hand off most of the required heap reasoning directly to the 
underlying CHC solver. This leads to a cleaner interface between 
the language-specific verification front-end and the language- 
independent CHC back-end, and enables verification tools for 
different programming languages to share a common heap back- 
end. The paper introduces TRICERA, gives an overview of the 
theory of heaps, and presents preliminary experimental results 
using SV-COMP benchmarks. 


I. INTRODUCTION 


This paper presents TRICERA, an automated open-source 
verification tool for C programs. TRICERA accepts programs 
in a subset of the C11 standard [1] with the purpose of 
checking whether explicit and implicit safety assertions in 
a program are valid. The tool has been developed mainly 
with applications in the embedded systems area in mind: 
restrictions in the supported language features are aligned 
with the recommendations made in the MISRA C coding 
guidelines [2]. TRICERA works by translating C programs 
to sets of Constrained Horn Clauses (CHCs), which are then 
processed and solved by the CHC solvers ELDARICA [33] or 
SPACER [37], thus either proving that assertions can never fail, 
or computing counterexample traces leading to an assertion 
violation. 

TRICERA is a model checker for C programs, but includes 
a plethora of additional features that go beyond C11, such 
as processing specifications in the ACSL language [6], mod- 
elling concurrent and parameterised systems, and augmenting 
programs with timing constraints. A distinguishing feature of 
TRICERA is the handling of heap data-structures, which are 
among the most challenging aspects in the verification of im- 
perative programs. Existing verification tools based on CHCs 
tend to handle heap either using the theory of arrays (e.g., as 
done by SEAHORN [30]), or apply bespoke encodings of heap 
data using refinement types [28], invariants (JAYHORN [35]) 
or prophecies (RUSTHORN [43]). As the heap encoder is 
often one of the most complex components of a CHC-based 
verification tool, this implies repeated implementation effort 
when designing verification tools for different programming 
languages, and migrating a tool to a different style of heap 
encoding is an extremely complex task. 
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We propose a departure from this conventional architecture 
of CHC-based verification tools, instead using a language- 
independent theory of heaps [24] augmenting the interface 
between verification tools and CHC solvers. The theory of 
heaps is designed to cover the features of many existing 
programming languages; it is deliberately kept simple, so that 
it can be integrated easily in verifiers; and it is kept high-level, 
so that CHC solvers are able to implement a wide range of 
methods for solving problems involving heap, including the 
aforementioned encodings through arrays and invariants. The 
resulting architecture is shown in Figure 1. 

TRICERA is the first verification tool that produces CHCs 
modulo the theory of heaps. At the point of writing this 
paper, in addition a project is underway to convert the Java 
verification tool JAYHORN [35] to use the theory. The de- 
velopment of effective solvers for CHCs modulo heaps is an 
ongoing effort as well; currently the CHC solver ELDARICA 
provides direct support for CHCs modulo heaps by integrating 
a native decision and interpolation procedure for the theory 
of heaps [23]. In addition a tool is available for translating 
CHCs with heaps to CHCs with algebraic data-types (ADTs) 
and arrays. A more detailed description of the theory of heaps 
is available as a technical report [24]. 

TRICERA is developed at Uppsala University and the Uni- 
versity of Regensburg. It is open source! and distributed under 
a 3-Clause BSD license. A web-interface to try it online is 
available’. 

The contributions of this paper are (i) a presentation of 
the verification tool TRICERA, including an overview of its 
features, the verification approach, and architecture; (ii) a 
definition of a high-level encoding of heap data using the 
theory of heaps; (iii) an experimental evaluation of TRICERA, 
on C benchmarks taken from SV-COMP, with and without 
heap. 


II. TRICERA FEATURES 


A. Input Language 


We start with an overview of the features and languages 
supported by TRICERA. As its main input language TRICERA 
can handle a large subset of C11 [1], extended with additional 
features that are useful for verification purposes. An overview 
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Fig. 1: Program verification using the theory of heaps. 


of the currently supported and unsupported types, operations 
and constructs is given in Table I. 

The initially supported subset of C11 is selected as to 
provide a strong foundation that can be easily extended. Our 
choice of language features to support is mainly influenced 
by safety-critical programs from the embedded systems area, 
and largely aligned with the recommendations made by the 
MISRA C [2]. 

TRICERA supports most of the standard C types with the 
exception of floating-point numbers, function pointers and 
strings. Integer types can be treated either as mathematical 
integers or as bit-vectors, in the latter case modelling the stan- 
dard wrap-around semantics. TRICERA has full support for 
the C operators and statements, including several extensions 
discussed later in this section. TRICERA can also handle a 
(small) part of the C library, in particular functions for memory 
allocation. Heap data, pointers, and arrays are encoded through 
the theory of heaps, which we discuss in more detail in 
Section V. 

The partial support in Table I for arrays refers to the 
following restrictions: (i) the type of array cells must be 
specified during allocation using malloc; (ii) pointers used 
to index arrays (either through brackets or pointer arithmetic) 
must be declared as arrays when they are first declared. For 
instance, int +a cannot be used to index an array, but 
int a[] can. Pointers to array cells are allowed: for instance 
int *b = &a[i] is allowed where a is an int array, but 
b cannot later be indexed as an array. Since arrays are a recent 
addition to TRICERA, these are restrictions of the current 
TriCera C front-end and not a theoretical limitation of the 


TABLE I: Supported subset of the C language (not exhaustive). 
v represents fully or almost-fully supported, -- represents 
partially supported, and X represents unsupported features. 


Types Vintegers (mathematical, machine arithmetic), «structs, 
v enums, “heap pointers, -Farrays, stack pointers, Xfloating 
point, Xstrings, Xfunction pointers, 
Expressions v (postfix, unary, logical, bitwise, arithmetic, cast operators) 
Statements (compound, expression, selection, iteration statements), 
and Blocks W(atomic, within and thread blocks (non-standard C)) 
Other v (assert and assume statements), v (malloc, calloc, and free) 

“threads, communicating timed systems, 

v function contract and loop invariant inference, 

-+FACSL parser (only for function contracts) 


theory of heaps. 

TRICERA has limited support for pointers to stack variables. 
Such pointers are statically associated with the variables they 
point to on the stack. This imposes some restrictions on such 
pointers: they cannot be mixed and matched with pointers to 
the heap, and they cannot be reassigned. The restrictions result 
in easier to solve encodings. 


The following paragraphs survey some of the additional 
features beyond C11. 


B. Supported Code Annotations 


In line with other model checkers, TRICERA uses assert 
and assume statements for explicitly specifying properties, 
which have their usual semantics as given by Flanagan and 
Saxe [27]. TRICERA in addition automatically adds several 
implicit properties: 

e all pointer de-references are checked for type safety, 

e array accesses are checked to be within array bounds, 

e (optionally) memory leaks are detected by ensuring all 

allocated memory on the heap is freed at program exit. 


Checking pointer de-references for type safety also implies 
memory safety, because TRICERA encodes unallocated loca- 
tions using a special type. More information is provided in 
Section V. 

Given a program with an entry function (default main), 
TRICERA will attempt to prove that none of the explicit and 
implicit properties can be violated. When TRICERA reports 
that an assertion is reachable, a counterexample trace is 
provided for debugging purposes. 

TRICERA supports the declaration of non-deterministically 
initialised (local or global) variables (with program type T) 
using the notation T x = _. 

Function calls in a program are handled, by default, 
through inlining; by annotating a function with the comment 
/*@ contract @x/, TRICERA can be instructed to instead 
compute a contract consisting of a pre- and post-condition for 
the function (also see Section II-C). Functions that do not have 
a body are assumed to produce some non-deterministic result, 
but not change global variables or heap data. 

Function contracts can optionally be specified using the 
ACSL specification language [6]. At the moment, TRICERA 
can parse and encode requires, ensures and assigns 
clauses. Listing 1 shows an example program that TRICERA 
can check. Programs annotated with contracts are verified 
modularly: for each function f with a contract, TRICERA will 
try to prove that f will never violate its contract that will then 
be used for encoding f at its call sites. More details about the 
supported ACSL features are given in [21]. 


C. Annotation Inference 


TRICERA can be used to automatically infer function con- 
tracts and loop invariants for safe programs (with respect to 
implicit and explicit assertions) [4]. An example program is 
given in Listing 2, encoding the tak function [44]. Based 
on the properties assumed and asserted at lines 12 and 14, 
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Listing 1: Example of ACSL function contracts in TRICERA. 
The program is unsafe because q is accessed but only p is 
specified in the assigns clause. 


1 /*@ 

2 requires \valid(p, q); 

3 assigns *p; 

4 x/ 

5 void foo(int*» p, int» q) { 
6 xq = 42; 

z] 


Listing 2: An example contract inference problem in TRICERA 
1 /+@ contract @*/ @ 


2 int tak(int x, int y, int z) { 
3 if (y < x) 

4 return tak(tak(x-l, y, z), 
5 tak(y-l, z, x), 
6 tak(z-l, x, y)); 
7 else return y; 

8 } 

9 

10 void main() { 

i Rit, 2, OBS 

12 assume(x > y && y <= z); 

13 int r = tak(x, y, zZ); 

14 assert(r == 2); 

15 } 


respectively, TRICERA is able to compute a contract for tak 
that is sufficient to show the safety of the program: 


fore :true 
foose (rT FZVY2>2VaE>y)A(rAyVyzzVyZa)A 
(r=zVr=yVy>2z)A(r=yVz>yVa>Yy) 


where fre and fpost are the pre- and post-conditions of tak. 

The inferred contracts and invariants can be printed in the 
ACSL language [6], as well as in SMT-LIB2 and in Prolog. 
As of writing this paper, ACSL printing is limited to programs 
without heap. 


D. Uninterpreted Predicates 


TRICERA allows declaration of uninterpreted predicates as 
annotations, which can then be used in assert and assume 
statements. Uninterpreted predicates provide a way to directly 
affect the generated set of CHCs, as assumptions about the 
shape of invariants can be manually specified. A program 


annotated with uninterpreted predicates is considered safe if 
and only if an interpretation of the predicates (in the sense of 


first-order logic) exists such that all assertions hold. 
An example application is given in Figure 2. In the left 
column, an array a is updated in a loop, and the loop at 


line 8 encodes the property Yj : 0 < j < n —> alj] = 2j. 
Although the program is simple, it turns out to be challenging 
for software model checkers, since a universally quantified 
property about the array elements is needed. 

The right column shows a version of the program rewritten 
for verification purposes; the uninterpreted predicate p_a is 
now used to specify a data invariant for the array a. The 
two arguments are selected to correspond to the index, and 
the value residing at that index, respectively. Writes to a are 
replaced with assertions to p_a as in line 6, which asserts that 
the array a contains the value 2*i at index i. Reads from a 
are replaced with assumptions with an additional fresh variable 
in lines 9-10. The program in the right column can be verified 
by TRICERA almost instantaneously. 

The encoding in Figure 2 closely corresponds to the en- 
coding of universally quantified properties in [13], and is also 
similar to the invariant encoding of [35], where heap data and 
operations are encoded through data invariants. Uninterpreted 
predicates in TRICERA make it possible to easily experiment 
with encoding tricks of this kind. 


E. Concurrency 


TRICERA has basic support for handling concurrency in 
programs. Static threads, executing concurrently with the 
main program, can be declared using the keyword thread. 
TRICERA currently applies a relatively simple, sequentially 
consistent thread model that is defined in [34]. This support 
for concurrency is mainly intended for modelling purposes, but 
is also useful, e.g., for defining monitors that check temporal 
properties during execution. For instance, the following thread 
asserts that the global variable x will never decrease during 
program execution. 


1 thread Monitor { 
2 int t = x: 

3 assert(x >= t); 
4 


} 


Thread interleaving can be controlled using atomic 
blocks, which mandate that all statements in the block are 
executed in one atomic step. Threads can moreover be con- 
trolled using synchronous rendezvous, which are introduced 
through UPPAAL-style binary communication channels [7]. In 
the following program, the two statements chan_send and 
chan_receive can only be executed together, thus ensuring 
that the assertion will be checked after the assignment: 


1 chan sẹ int x: 
2 thread A {x = 
3 thread B {chan_receive(s); 


42; chan_send(s);} 
assert(x>0);} 


Finally, TRICERA also supports the declaration of infinitely 
replicated threads, which are useful to model dynamic thread 
creation and parameterised systems. An example of a param- 
eterised model is given in the next section. 
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1 

2 void main () { 

3 int is n =s 

4 int afln]; 

5 for (i = 0; i < n; ++i) { 
6 ali] = 2i; 

7} 

8 for (i = 0; i < n; ++i) { 
9 

10 

11 assert(a[i] == 2*i); 

12 

13. -} 


1 /*$ p_a(int, int) $x/ @ 
2 void main () { 

3 int i, n= 5 

4 

5 for (i = 0; i < n; ++i) { 
6 assert(p_a(i, 2*i)); 

t J 

8 for (i = 0; i < n; ++i) { 
9 int v =_; 

10 assume(p_a(i, v)); 

11 assert(2*i == v); 

12 

13 } 


Fig. 2: Encoding an array program (left column) using uninterpreted predicates (right column). 


Listing 3: The parameterised Fischer protocol [3] 


1 int lock = 0; @ 

2 thread[tid] Proc { 

3 clock C; 

4 assume (tid > 0); 

5 

6 while (1) { 

7 atomic { assume(lock == 0); C = 0; } 
8 within (C <= 1) { lock = tid; } 

9 C = 0; assume(C > 1); 

10 

11 if (lock == tid) { // critical sect. 
12 assert(lock == tid); 

13 lock = 0; 

14 } 

15 } 

16 } 


F Timing Constraints 


For modelling purposes, TRICERA supports timing con- 
straints in C programs. C programs with time have semantics 
similar to UPPAAL timed automata [7], which means that 
computations (program instructions) consume zero time, but 
are interleaved with explicit time-elapse transitions. The pass- 
ing of time can be observed using clocks, which are declared 
as variables of type clock, can be reset to 0, and can be 
compared with constants in assert and assume statements. 

As an example, Listing 3 shows a parameterised version 
of the well-known Fischer mutual exclusion protocol [3]. An 
arbitrary number of processes can participate in the protocol by 
communicating through a shared variable lock. In line 2, for 
this purpose an infinitely replicated thread Proc is declared. 
Each instance of Proc has a unique thread id tid of type 
int and a clock C. Each process executes a simple loop: it 
waits until it observes that Lock == 0, and then writes its 


thread id to lock. The within block in line 8 has similar 
semantics as an UPPAAL time invariant: it enforces execution 
of the assignment before the condition C <= 1 has become 
false, i.e., at most one time unit after executing the block in 
line 7. The process then waits for more than one time unit in 
line 9, and then checks that no other process has meanwhile 
overwritten the value in lock. Line 12 asserts that at most 
one process is able to enter the critical section at a time. 

TRICERA is able to verify the safety of this model for an 
unbounded number of participating threads, using an encoding 
of the program as CHCs over k-indexed invariants [34]. 


HI. THE TRICERA VERIFICATION APPROACH 
A. Constrained Horn Clauses 


TRICERA analyses programs by translating them to sets 
of Constrained Horn Clauses (CHCs, or just clauses in this 
paper), in such a way that the CHCs are satisfiable iff the 
program is safe. A Constrained Horn Clause is a sentence 
Va. (c A BIA... ABr > H) where H is either an atom 
(application of a predicate to first-order terms) or false, Bi 
(for 1 < i < n) is an atom, and C is a constraint over 
some background theories (including heaps). A CHC with at 
least one positive literal (an atom or its negation) is called a 
definite clause, and a CHC with no positive literals is called 
a goal clause (or an assertion clause). In the rest of the paper 
we leave the universal quantification of variables implicit, 
and write the clauses from right to left in the spirit of logic 
programming. 


B. The Architecture of TRICERA 


An overview of the TRICERA architecture is given in 
Figure 3. The preprocessor and the CHC solver are external 
tools; we call the whole toolchain “TRICERA”. 

a) Preprocessor: Input programs are preprocessed in 
order to simplify parsing and encoding. To simplify parsing, 
all typedefs are removed and some language constructs are 
normalised into a standard form. Unused type and function 
declarations are removed; removing unused data-types makes 
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Fig. 3: An overview of TRICERA. 


modelling heap simpler as the number of possible types for a 
heap object is reduced. The preprocessor is written as a stand- 
alone tool using the LibTooling? library in C++. 

b) Verifier Core: The TRICERA core component is a 
translator from C programs into CHCs, written in Scala. The 
verifier core works by first creating a parse tree of the input 
program, and then translating this tree into a set of CHCs. The 
translator supports the language features given in Table I. 

The CHC encoder also includes a CHC simplifier that post- 
processes the generated CHCs before being sent to a CHC 
solver. This simplifier attempts to merge the CHCs in order to 
produce a smaller but equisatisfiable set of CHCs. 

c) CHC solver: The resulting set of CHCs are finally 
sent to a CHC solver to check if their conjunction is satisfiable. 
TRICERA primarily uses ELDARICA [33] for this purpose as 
it has native support for the theory of heaps, and can easily be 
integrated as a Scala library; however, the final set of CHCs 
can also be post-processed by eliminating heap operations, 
instead encoding using the theory of arrays, and then be 
checked by other solvers such as Z3/SPACER [37]. 


C. Programs as Constrained Horn Clauses (CHCs) 


The overall translation from sequential programs to CHCs 
applied by TRICERA follows the strategy defined, e.g., in [12], 
[29]. In this setting, linear CHCs are used to model the control- 
flow graph of a program: each node of the graph is interpreted 
to represent the set of possible states at a program location and 
each transition corresponds to a program control instruction. 
Asserted properties add additional sink nodes to the graph, 
whose edges are the negations of those properties. The goal of 
the process is to discover program invariants that are sufficient 
to show that none of the sink nodes is reachable. 

A program is thus encoded in CHCs as follows: 


e An uninterpreted predicate is declared for each program 
location to represent program invariants: the interpreta- 
tions of these predicates (provided by the CHC solver 
when the set of CHCs is satisfiable) correspond to sets of 
program states that hold at each location. The arguments 
to a predicate are all program variables currently in scope, 
as well as additional terms required in the encoding, for 
instance terms representing the heap. 


3https://clang.llvm.org/docs/LibTooling.html 


e A definite clause consisting of only a single positive 
literal is added as program entry, e.g., P(...) < true. 
The CHC (1) in Figure 4 is an example. 

e A definite clause is introduced for each program control 

instruction. These CHCs encode Hoare triples between 

locations [32]. The set of CHCs can be cyclic (e.g., 

{Pi(...) < Po(...), Pol...) | Pi(...)}), representing 

program loops. Guarded control instructions are encoded 

by adding the guards as constraints. The CHCs (2) — (5) 

in Figure 4 provide an example. 

Two clauses are added for each asserted property: a goal 

clause whose constraint is the negation of the asserted 

property, and a definite clause whose constraint is the 

asserted property. The CHCs (6) and (7) in Figure 4 

provide an example. 

Functions are encoded either through predicates repre- 

senting their pre-/post-conditions, or by inlining them. 


An example encoding is provided in Figure 4. 

The translation of concurrent and timed programs follows 
the calculus defined in [34]. To handle concurrency, TRICERA 
uses a variant of the Owicki-Gries proof rules [47], [34], to 
which explicit variables to represent time and clocks are added. 
The representation of replicated threads uses the k-indexed 
invariants approach [52], [34]. 


IV. THE THEORY OF HEAPS 


One of the most challenging aspects of encoding computer 
programs as CHCs is the encoding of heap-allocated data- 
structures and heap-related operations. One approach to rep- 
resent such data-structures is using the theory of arrays (e.g., 
[36], [17]). This is a natural encoding since a heap can be seen 
as an array of memory locations; however, as the encoding is 
byte-precise, in the context of CHCs it tends to be low-level 
and often yields clauses that are hard to solve. 

An alternative approach is to transform away such data- 
structures with the help of invariants or refinement types (e.g., 
[49], [13], [45], [35], and the example in Section II-D). The 
resulting CHCs tend to be over-approximate (i.e., can lead 
to false positives), even with smart refinement strategies that 
aim at increasing precision. This is because every operation 
that reads, writes, or allocates a heap object is replaced with 
assertions and assumptions about local object invariants, so 
that global program invariants might not be expressible. In 
cases where local invariants are sufficient, however, they can 
enable efficient and modular verification even of challenging 
programs. 

Both approaches leave little design choice with respect to 
handling of heaps to CHC solvers. Dealing with heaps at the 
encoding level also implies repeated effort when designing 
verifiers for different programming languages. 

The vision of the presented line of research is to extend 
CHCs to a standardised interchange format for programs with 
heaps. We apply a high-level theory of heaps [24] that does 
not restrict the way in which CHC solvers approach heap 
reasoning, while covering the main functionality needed for 
program verification: (i) representation of the type system 
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/* PO */ 

if(x > 0) 
xt=1; /*Pl«/ 

else 
x-=1; 

/* P3 */ 

assert (x > 0); 

[x P4*/ 


[x P2 */ 


AAYDNFWN KE 


Po(x) + true (1) 
P(x) + Po(x) Ax >0 (2) 
P(x) + Po(x) Aa <0 (3) 
P(x’) & P(x) Aa’ =a4+1 (4) 
P3(2') + Po(x) Aa’ =x-1 (5) 
P(x’) © P3(x) Ax > 0 (6) 

false + P3(x) Nx <0 (7) 


Fig. 4: The CHC encoding of a branching statement 


Listing 4: SMT-LIB-style declaration of a heap. In lines 4— 
8 the constructors and the selectors of the data-types are 
declared. The constructors and selectors in lines 5—6 serve 
as the wrappers and the getters for the program types Node 
and int. Node is encoded as an ADT (line 4) and the C type 
int is encoded using mathematical integers (Int). 


1(declare-heap 

2 Heap Addr Object O_Empty 

3 ((Node 0) (Object 0)) 

4 (((Node (data Int) (next Addr))) 
((O_Node (getNode Node) ) 

(O_Int (getInt Int )) 
(O_Uninit_Node) (O_Uninit_Int) 
(O_Empty )))) 


OoryArnAN 


associated with heap data; (ii) reading and updating of data 
on the heap; (iii) object allocation. 

The theory of heaps employs algebraic data-types (ADTs), 
as already standardised by SMT-LIB [5], as a flexible way to 
handle (i). The theory offers operations akin to the theory 
of arrays to handle (ii) and (iii). Arithmetic operations on 
pointers are excluded in the theory, as are low-level tricks 
like extracting individual bytes from bigger pieces of data 
through pointer manipulation. Being language-agnostic, the 
theory of heaps allows for common encodings across different 
applications. 

a) Sorts: To encode a program using the theory of heaps, 
first a heap data-type has to be declared that covers the 
required program types; a declaration in SMT-LIB notation 
is shown in Listing 4. Each declared heap introduces the 
three sorts, Heap, Addr and AddrRange, and in addition can 
declare any number of ADTs later used to represent the data 
stored on the heap (lines 5-8, see Section V). A Heap address 
has the sort Addr. Although an address itself does not carry 
type information, the type of a heap Object can be checked 
using ADT discriminator functions. A range of addresses can 
be defined with the AddrRange sort, which is needed when 
encoding contiguous data-structures such as arrays. 

The objects on the heap are represented with a single Object 
sort, which can either be selected from one of the pre-declared 
sorts, or declared as an ADT in a heap theory declaration. The 
latter makes referring to heap theory sorts possible, such as 
Addr, as done in Line 4 of Listing 4. In the sequel we call a 


constructor function that produces an Object a wrapper, and 
a selector that returns the underlying term from an Object a 
getter. 

b) Operations: The operations of the theory of heaps are 
given in Table II. The function allocate is used for allocating 
new objects on the heap, and each allocation returns a new 
(Heap, Addr) pair that is valid and contains the passed object. 
The allocatedness of an Addr in a Heap can be tested using 
the predicate valid. The function emptyHeap returns a heap 
that is invalid at all addresses, and nullAddr returns an address 
that is invalid in all heaps. 

The functions read and write are used for reading from 
and writing to heap addresses. If a read address is invalid, the 
default object is returned (O_Empty in line 2 in Listing 4). An 
invalid write returns the heap that was passed to the function 
without any modifications. The default Object to be returned 
on invalid reads is specified in the heap declaration, and this 
is needed to make the read function total. 

Operations (14)-(17) are used for batch heap operations, 
which are needed when encoding array-like data on the heap. 
These operations operate over address ranges rather than 
single addresses (Addr). The functions batchAllocate and 
batchWrite allow batch allocation and batch update of ad- 
dress ranges. Given an address range, nthInAddrRange allows 
the extraction of an individual address, and the predicate 
withinAddrRange allows testing if an address is within a range. 

c) Implementation: The theory of heaps is currently im- 
plemented in the SMT solver PRINCESS [50] and in the CHC 
solver ELDARICA [33]. The decision procedure for solving 


TABLE II: Operations defined by the theory of heaps 


emptyHeap : 
nullAddr : 
allocate : 


() > Heap (8) 
()— Addr (9) 
Heap x Object > Heap x Addr (10) 
Heap x Addr — Bool (1) 
Heap x Addr — Object (12) 
write : Heap x Addr x Object + Heap (13) 
batchAllocate : Heap x Object x N —> Heap x AddrRange (14) 
batchWrite : Heap x AddrRange x Object > Heap (15) 
nthInAddrRange : AddrRange x N— Addr (16) 
withinAddrRange : AddrRange x Addr — Bool (17) 


valid : 
read : 
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TABLE III: O_T is the object wrapper for sort T, which is the 
encoding of the program type T. O_T (To) constructs a zero- 
valued term of sort T. h represents the heap. Non-primed and 
primed terms encode the same program variable (and the heap) 
before and after the execution of a statement. x is a variable, 
p is a (non-array) pointer. a and b are pointers to arrays of 
type T. i, j and n are integers. 


C statement Mathematical encoding with heaps 


h’ = write(h, p,O_T(a)) A 


TPS (is-O_Uninit_T (read(h, p)) V is-O_T (read(h, p))) 
x = 4p; x = getT'(read(h, p)) A is-O_T(read(h, p)) 

x = getT'(read(h, nthInAddrRange(a, i))) A 
x = ali]; is-O_T (read(h, nthlnAddrRange(a, i))) A 
withinAddrRange(a, i) 

h’ = write(h, nthInAddrRange(a,i),O_T(x)) A 

alil = a (is-O_Uninit_T (read(h, nthlnAddrRange(a, i))) V 

; is-O_T(read(h, nthInAddrRange(a, z)))) A 


withinAddrRange(a, i) 
(h', p) = allocate(h, O_Uninit_T) 
(k', p) = allocate(h, O_T(To)) 
(h’, p) = batchAllocate( 
h, O_Uninit_T, n) 
h’ = write(h, p,O_Empty) A 


p = malloc(sizeof(T)); 
calloc(sizeof(T)); 


xe] 
Il 


a = malloc(sizeof(T) * n); 


SESLE —is-O_Empty(read(h, p))) 
h’ = batchWrite(h, a, O_Empty) A 
free(a); Vq : Addr.(withinAddrRange(a, q) > 


—~is-O_Empty(read(h, q))) 


formulas over the theory in PRINCESS is introduced in [23]. 
ELDARICA mostly defers the solving of heap theory formulas 
to PRINCESS; there is ongoing work to implement additional 
static analysis of heap properties directly in ELDARICA. 


V. ENCODING OF C PROGRAMS WITH HEAP 


When translating programs with heaps, TRICERA augments 
all introduced relation symbols (state invariants and pre- 
conditions) with explicit Heap arguments; post-conditions 
receive both the pre- and the post-heap. 

A heap Addr can be seen as a direct counterpart of an 
(untyped) C pointer. Any program type that makes use of an 
Addr, such as a list node, needs to be declared as part of the 
heap theory declaration. Lastly, Object wrappers and getters 
need to be declared for all program types that can be on the 
heap. For instance, Listing 4 shows a heap declaration for a 
program over (mathematical) integers and a node struct: 


struct Node { 
int data; 
struct Node* next; 


}; 


Since Node has a pointer field, it is declared as an ADT 
as part of the heap declaration as shown in line 4 of List- 
ing 4. The object wrappers and getters for all program types 
are declared in lines 5—6. Additional empty object wrappers 
are defined in lines 7-8 to serve as the uninitialised and 
default objects respectively. TRICERA uses the default object 
to mark de-allocated locations as shown in Table II. The 


uninitialised objects are used as initial values for allocated 
memory locations with uninitialised values, as is the case with 
malloc. An uninitialised object constructor is declared for 
each programming type on the heap. 

After the heap is declared, every statement that accesses the 
heap is encoded as shown in Table III. A new heap term h’ is 
produced for statements modifying the starting heap term h. 

Each de-reference of a pointer is also coupled with a type- 
safety assertion. In the table, those assertions are conjoined 
with the actual transition relations of the CHCs; as a result, the 
stated formulas describe all correct executions of a statement. 
TRICERA in addition introduces assertions that will detect 
cases in which these conditions are violated. For instance, 
for the expression *p, assuming that p is encoded using the 
sort T, TRICERA asserts the predicate is-O_T(read(h, p)). is- 
O_T is the discriminator predicate for the ADT sort T. Since 
invalid reads would return the default object, this type-safety 
assertion doubles as a memory safety assertion. C also allows 
the allocation of uninitialised memory; TRICERA models this 
by placing the object O_Uninit_T’ in these addresses, which 
represents an uninitialised value for the sort T. 

Functions that require byte-level access to data-structures 
such as memset are currently not supported by TRICERA; 
however, these can be handled without introducing a full 
byte-level memory representation. It is sufficient to infer 
which values a heap object can assume when setting all its 
bytes to a certain value, taking into account the compiler 
and architecture when necessary. To prevent aliasing when 
using such functions, safety assertions that ensure the accessed 
memory region belongs to a single object can be automatically 
added to each access. 


a) Arrays: TRICERA uses the theory of heaps also to 
model C arrays. Arrays are allocated and freed using the 
batch operations of the theory. The address of an array cell 
is obtained with the nthInAddrRange function, which can 
then be used as any other address. Whenever an array cell 
is accessed (a[i]), TRICERA automatically asserts that the 
accessed index is within bounds (withinAddrRange(a, i)). 

Arithmetic operations on array pointers can be supported 
by augmenting AddrRange terms with offsets (not shown in 
Table II). This yields a model in which arithmetic on array 
pointers is possible, but modified pointers have to remain in 
the same array, which is again in line with the MISRA C 
coding guidelines [2]. 

The theory of heaps does not provide a direct operation for 
de-allocation, i.e., an allocated address always remains valid. 
TRICERA overcomes this limitation by writing the default 
object (O_Empty) to de-allocated addresses, and provides an 
option to add a memory safety assertion such that all addresses 
must contain the default object at program exit. Double-freeing 
of memory is caught by an additional assertion that the freed 
addresses do not contain the default object. 

Stack-allocated arrays are also modeled using the theory of 
heaps, and the functions to free their memory are automatically 
added by TRICERA when they go out of scope. Non-array 
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Source files Heap Array 
(C) TriCera (CHCs) heap2array (CHCs) 


Fig. 5: The three sets of heap benchmarks: (i) the source 
C benchmarks from SV-COMP are encoded into (ii) CHCs 
modulo the theory of heaps using TRICERA, then heap2array 
is applied to these benchmarks to produce (iii) CHCs modulo 
the theory of arrays. 


stack pointers do not make use of the theory, and are supported 
with some limitations as described in [22]. 

Table III does not show the encoding of field updates for 
record types, for instance p->f = x where p is a pointer to 
a record type with f as one of its fields. This is encoded by 
first reading the record from p, creating a new ADT term with 
only the field £ updated, and then writing back the new ADT 
to the address pointed to by p. 


VI. EXPERIMENTAL RESULTS 
A. Benchmarks 


As TRICERA does not have a model of pthreads yet, we 
focus in our evaluation on sequential C programs. We col- 
lected C benchmarks from SV-COMP 2022’s ReachSafety and 
MemSafety categories [10], and generated CHCs in the SMT- 
LIB [5] format for all benchmarks that the current version 
of TRICERA could parse and encode (see Section II-A). This 
resulted in 396 heap (i.e., where the heap is modelled using the 
theory of heaps) benchmarks (349 in the ReachSafety and 128 
in the MemSafety categories, with some benchmarks occurring 
in both categories), and 1453 non-heap benchmarks in the 
ReachSafety category. Many of the benchmarks that TRICERA 
could not parse were under the Juliet test and the Linux 
device driver suites, which failed mainly due to currently 
unsupported operations and constructs such as memcpy and 
function pointers. Mathematical integer semantics was used in 
the benchmarks encoded by TRICERA. 

For the heap benchmarks, an additional set of benchmarks 
was created through a translation of the theory of heaps into 
the theory of arrays, using an extended version of the encoding 
given in [24] implemented in the tool heap2array*. This 
serves the purpose of making additional back-ends available 
to solve the generated CHCs. Similarly generated benchmarks 
were also submitted to CHC-Comp 2022 and were part of 
the LIA-nonlin-Arrays-nonrecADT track’. The benchmark cre- 
ation process is depicted in Figure 5. 

We then applied two of the top CHC solvers currently 
available [26] to the CHCs: ELDARICA, which is the default 
solver in TRICERA and natively supports the theory of heaps, 
and Z3/SPACER [37]. We have used the default settings in 
both ELDARICA and Z3/SPACER. ELDARICA was used in two 
different configurations for the heap benchmarks: TRICERA 
(ELDARICA-heap), using the native solver for the theory of 


*https://github.com/zafer-esen/heap2array 
Shttps://github.com/zafer-esen/tricera-adt-arr 


heaps on the CHCs with heaps, and TRICERA (ELDARICA- 
array), applying ELDARICA’s array solver to the array version 
of the CHCs. Z3/SPACER was only applied to the array 
benchmarks (TRICERA (Z3/SPACER)). The portfolio rows in 
the result tables show the results achieved by running both 
back-ends of TRICERA in parallel and taking the first result 
(TRICERA (portfolio)). 


B. Experimental Setup 


The experiments were ran on an AMD Opteron 2220 SE 
(2.8 GHZ with 4 CPUs) machine running 64-bit Linux with 
6 GB of RAM and a wall-clock timeout of 900 seconds. To 
compare TRICERA® against the state of the art, we gathered 
the results published by SV-COMP 2022 [9] for the Reach- 
Safety and MemSafety tracks. 


C. Results 


The results are given in Table IV for the non-heap bench- 
marks in the ReachSafety category, in Table V for the heap 
benchmarks in the ReachSafety category and in Table VI 
for the heap benchmarks in the MemSafety category. All 
benchmarks can be found in [25]. 

For non-heap, TRICERA showed performance competitive 
with the best tools evaluated at SV-COMP, in particular 
on safe problems. The TRICERA results are not completely 
comparable to the results of SV-COMP tools due to the use 
of mathematical integer semantics in TRICERA, however. For 
19 benchmarks, the statuses reported by TRICERA were incon- 
sistent with the expected SV-COMP statuses for this reason. 
The two TRICERA back-ends, ELDARICA and Z3/SPACER, 
always produced the same answer. 

For heap problems, TRICERA performed worse than some 
of the tools based on bounded model checking or symbolic 
execution, but was comparable with CEGAR-based tools like 
CPACHECKER. Comparing the TRICERA back-ends, ELD- 
ARICA applied to the array encoding performs best by some 
margin (TRICERA (ELDARICA-array)). 

TRICERA currently cannot check for reachability and 
memory-safety properties separately, it always adds the im- 
plicit memory-safety assertions. This, coupled with the use 
of mathematical integers, led to results that did not match 
their expected SV-COMP statuses in 25 heap benchmarks 
(13 reported incorrectly unsafe, 12 reported incorrectly safe) 
using the portfolio method; again there were no inconsistencies 
between the different TRICERA back-ends. 


VII. RELATED WORK 


There are several other verification tools that make use 
of CHCs, and many others for verifying C programs. As 
discussed in Section I, these tools either transform away the 
heap, or use the theory of arrays for encoding heap. 

JAYHORN, a model checker for Java programs, encodes 
heap by using invariants that summarise the possible states 
of a reference at a program location [35], which is inspired 
by methods like liquid types [49]. Although this method is 


Shttps://github.com/uuverifiers/tricera/commit/5ffd2b6 
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TABLE IV: Results for the non-heap benchmarks in the 
ReachSafety category. The column “solved” gives the total 
number of “safe” or “unsafe” results. 


safe unsafe unknown solved 
GOBLINT [51] 180 0 1273 180 
THETA [53] 250 140 1063 390 
UKOJAK [46] 278 221 954 499 
VERIFUZZ [15] 0 515 938 515 
2LS [42] 428 265 760 693 
CBMC [38] 313 394 746 707 
TRICERA (Z3/SPACER) 442 271 740 713 
CRUX [19] 293 427 733 720 
LART [40] 346 392 715 738 
ESBMC-KIND [41] 484 380 589 864 
SYMBIOTIC [14] 423 458 572 881 
UTAIPAN [18] 598 298 557 896 
UAUTOMIZER [31] 612 302 539 914 
PESCO [48] 584 458 411 1042 
TRICERA (ELDARICA) 698 360 395 1058 
GRAVES-CPA [41] 636 442 375 1078 
TRICERA (portfolio) 730 379 344 1109 
CPACHECKER [11] 666 470 317 1136 
VERIABS [16] 739 507 207 1246 


TABLE V: Results for the heap benchmarks in the ReachSafety 
category. 


safe unsafe unknown solved 
THETA [53] 10 7 332 17 
GOBLINT [51] 27 0 322 27 
GRAVES-CPA [41] 22 26 301 48 
TRICERA (ELDARICA-heap) 12 36 301 48 
2Ls [42] 35 21 293 56 
TRICERA (Z3/SPACER) 20 40 289 60 
UTAIPAN [18] 32 33 284 65 
UAUTOMIZER [31] 32 35 282 67 
UKOJAK [46] 25 42 282 67 
VERIFUZZ [15] 0 71 278 71 
TRICERA (ELDARICA-array) 36 49 264 85 
TRICERA (portfolio) 39 58 252 97 
CRUX [19] 55 48 246 103 
CPACHECKER [11] 58 46 245 104 
PESCO [48] 65 47 237 112 
CBMC [38] 65 51 233 116 
LART [40] 90 30 229 120 
SYMBIOTIC [14] 102 62 185 164 
ESBMC-KIND [41] 122 49 178 171 
VERIABS [16] 223 81 45 304 


TABLE VI: Results for the heap benchmarks in the MemSafety 
category. 


safe unsafe unknown solved 
VERIFUZZ [15] 0 5 123 5 
SESL 0 11 117 11 
UAUTOMIZER [31] 6 11 111 17 
UTAIPAN [18] T 10 111 17 
UKOJAK [46] 9 10 109 19 
2Ls [42] 14 11 103 25 
TRICERA (ELDARICA-heap) 12 19 97 31 
TRICERA (Z3/SPACER) 23 16 89 39 
CPACHECKER [11] 53 10 65 63 
TRICERA (ELDARICA-array) 39 24 65 63 
TRICERA (portfolio) 40 26 62 66 
ESBMC-KIND [41] 54 18 56 FD 
CPA-BAM-SMG 53 30 45 83 
CBMC [38] 54 36 38 90 
SYMBIOTIC [14] 68 36 24 104 


incomplete (i.e., can lead to false positives), with various 
optimisations the authors have managed to significantly im- 
prove its effectiveness. Using the theory of heaps, much of the 
work in JayHorn could be shifted to a CHC solver. TRICERA 
and JAYHORN both use ELDARICA for solving the generated 
CHCs, but otherwise do not share any infrastructure. 

RUSTHORN is a verifier for Rust programs, and also trans- 
forms away the heap [43] by exploiting the ownership system 
of Rust. Since the method is not directly applicable in case of 
unsafe code blocks, a theory of heaps could be used to extend 
the tool in this direction. 

SEAHORN is a verifier for LLVM-based languages [30]. 
SEAHORN employs Z3/SPACER as one of its back-ends for 
CHC-based model-checking. It also employs various static 
analyses that can be used on their own as a verification engine, 
or to provide invariants to its CHC back-ends. SEAHORN 
encodes the heap as a set of non-overlapping arrays that 
are created by a data structure analysis (DSA) [39]. Since 
SEAHORN works with the LLVM intermediate representation, 
it can be used to target other LLVM-based languages than C. 
In contrast, TRICERA comes with its own parser that currently 
cannot handle all the peculiarities of C; however, its custom 
parser can handle several non-standard C constructs as shown 
in Table I and can easily be extended. 

KORN is a verifier for C programs that uses CHCs; however 
its main focus is showing the feasibility of using loop contracts 
as opposed to loop invariants and currently supports a small 
fragment of C [20]. KORN uses ELDARICA as one of its back- 
ends. 

Information about the other verification tools evaluated in 
Section VI can be found in the SV-COMP report [8]. 


VIII. CONCLUSIONS AND OUTLOOK 


This paper has introduced the verification tool TRICERA, 
given an overview of the encoding of C programs using 
the theory of heaps, and provided first experimental results 
using SV-COMP benchmarks. Both TRICERA and the theory 
of heaps are still under development, and planned future 
work includes support for further features of C (see Table I), 
improved decision and interpolation procedures for the theory 
of heaps, and the development of additional heap back-ends 
(in particular along the lines of [35]). Once multiple CHC 
solvers with support for the theory of heaps are available, we 
will also propose a heap track at the Horn solver competition 
CHC-COMP. 
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